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Abstract 
The iniplication problem for database constraints is central in the ficlds of automated schema 
design and query optimization and has been traditionally approached with resolution-based 
techniques. We present a novel approach to database constraints, using eguations instead of Horn 
clauses. This formulation enables us to use new techniques for database theory, which derive from 
universal algebra, equational logic and lattice theory. It also points yo the possibility of employing 
theorem-proving techniques originally developed for equational theories to deal with implication in 


the context of logical databases. 

We apply our approach to study functional and inclusion dependencies. These constraints can model 
functional determination and data duplication and they have been extensively proposed as a 
powerful and realistic feature for semantic data models. We prove completeness of new proof 
procedures and we derive new upper and lower bounds for the complexity of various implication 
problems involving these dependencies. 

We also present a new class of constraints which are defined equationally, using algebraic operations 
on set-theoretic partitions. These partition dependencies provide an clegant generalization of 
functional dependencies (in the direction of incorporating transitive closure), for which the 


implication problem remains efficiently solvable. 
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Chapter One 


Introduction 


1.1 Functional and Inclusion Dependencies in the Relational Model 


The development of the relational data model [21, 22] \ed to major progress in the area of database 
management. The model and its implementations have contributed significantly both to the increase 


of programmer productivity [23] and to the fundamental understanding of computation [62]. 


Among the advantages of the model, which account for its success, are [23]: 
1. The sharp, clear boundary it provides between the conceptual and the physical aspects of database 
management. 
2. Its simplicity, which allows users and programmers to have a common understanding of the data 
and therefore communicate easily about it. 
3. The introduction of truly high level language concepts, which enables users to express operations 
on large pieces of information, without detailed knowledge of its representation or of the access paths 
to where it is stored. 
4. A sound, mathematical foundation, which makes possible the theoretical study of the (often 


formidable) problems of database design and manipulation. 


The relational data model consists of a structural part (with a unique data type, the relation), a 
manipulative part (with powerful algebraic operators such as selection, projection and join) and an 
integrity part (constraints defining consistent database states, intended to capture the semantics of 
particular applications) [62, 51]. A relation is a table with columns named by attributes and with rows 
containing values from some domain, each row being a tuple. A database is a finite set of relations. A 
logical database or database schema consists of a database scheme, i.e. a finite sct D of relation 
schemes (sequences of attributes naming the columns of relations), along with a finite set 2 of 
integrity constraints (dependencies), which should be satisfied by all legal physical databases (database 


instances). 


For an example (invariant throughout the database literature), consider a database of two relations 


RS, where R has attributes EMPLOYEE and MANAGER and S has attributes MANAGER and 
DEPARTMENT. If we take as our semantic restrictions that “every employee has exactly one manager" 
and “every manager manages exactly one department”, we define the following database schema: 
D= {R[EMPLOYER, MANAGER], S[MANAGER, DEPARTMENT]} 
2 = {RiEMPLOYEE—+MANAGER, S!MANAGER— DEPARTMENT} 

In this case, our constraints are examples of functional dependencies [21, 22, 62, 51]. Formally, a 
functional dependency (FD) is an assertion of the form R:X—Y, where R is the name of a relation 
and X.Y are sets of attributes from the relation scheme of R. It is satisfied by a database instance iff 
whenever two tuples of relation R agree on all attributes appearing in X, they also agree on all 
attributes appearing in Y. Observe that, with no loss of generality, we can take Y to consist of a single 
attribute. 

Functional dependencies form a conceptually simple and naturally occuring class of constraints. For 
this reason, they have been extensively studied in the literature (sce [7, 62, 51] for reviews of the area). 
Combined with the algebraic operators of the relational model they provide a practical and elegant 


approach to the problems of database design and manipulation. 


At present, a major research effort is underway towards extending the relational modcl. This effort 
is motivated in large part by the success of the relational methodology and by the demands of spccific 
application domains, in particular Office Automation (see, ¢.g., [20, 24, 37, 42, 59, 61], which is by no 
means an exhaustive list). The approach generally taken is to appropriately enrich the integrity part 
by adding constraints which will enhance the expressive power of the model, while at the same time 


preserving its original advantages. 


Returning to our example, suppose we also want to be able to express simple facts such as 
"everyone who manages employces belongs to some department". In other words, we want to add to 
the semantics of our relations that a MANAGER entry in relation R must also appear as a MANAGER 
entry in relation §.This constraint is formally captured by the inclusion dependency [16] 
R:MANAGERCS:MANAGER, In general, an inclusion dependency (IND) is a statement of the form 
R:Aq...AmG€S:B;...B,,. Such a statement is satisfied by a database instance iff whenever a tuple with 
entrics aj,....a,, for attributes Aj,....A,, appears in relation R, a tuple with entrics aj,...,a,, for 


attributes B)....,B,, appears in relation S. 


Inclusion dependencies make it possible to selectively define what data must be duplicated in 


what relations and thus they provide a valuable tool for database design [24, 59, 69]. The central 
notion of referential integrity [24,29] can be expressed using IND's. Together with FD’s, IND’s form 
the basis of the structural model of [67]. Descriptions of logical databases written in a variety of 
languages can be translated into a common language which uses relations, FD’s and IND’s [45]. 
Inclusion dependencies have also been employed to map an entity-relationship schema to the 
relational model [20]. We mention in passing that IND’s have been commonly known in Artificial 


Intelligence applications as 1A relationships (cf. [9]). 


Although the addition of IND’s to the relational model has been recognized as realistic and 
desirable (because of their conceptual simplicity and expressive power), they have become only 
recently the object of theoretical investigation [16, 43, 54, 19, 58, 17, 44, 48, 26]. General questions 
relating to the implication problem for IND’s and F1D’s have been studied in [16, 54, 19]. A rather 
surprising result [54, 19] is that the combination of IND’s with FD’s is as powerful computationally as 
first-order predicate calculus. This result can be considered both positive (as it hints to the possibly 
tich potential of two simple primitive forms) and negative, as it implics inherent computational 
intractability of the general case. From a more practical standpoint, [43, 17, 44, 26] provide solutions 
to database design and query optimization problems in the presence of (suitably restricted) IND’s 
and FD’s. Also, central notions such as the Universal Instance Assumption [62,51] have been 
investigated using IND’s [58, 48]. We will review the theoretical work on IND’s in more detail in the 


sequel. 


1.2 The Implication Problem 


The (warestricted) implication problem for a class of dependencics is the following: Given a finite 
set & of dependencies and a dependency o, test if o holds in a// (not neccessarily finite) databases 
which satisfy the dependencies in =. By restricting attention to finite databases, we obtain the finite 


implication problem. 


Solving the implication problem is the main computational task associated with a class of 
dependencies. As a rule, algorithmic approaches to database schema design and query optimization 
are based on efficicnt solutions of the implication problem (sce, e.g., [12, 6, 3, 18, 62, 51]). Evidently, 
if we are concerned with applications then the finite implication problem is the one which is most 


relevant. However, it tends to be much more difficult to deal with. Morcover, for the classes of 


dependencies for which implication is decidable, it gencrally happens that finite implication 


coincides with unrestricted implication. 


The problem of dependency implication can be approached in a very gencral setting by 
formulating dependencies as sentences in first-order logic, namcly as Horn clauses [34] (see Section 
5.1 of this thesis for some examples). Closely related to this approach is a particular proof procedure, 
the chase; sec [S2, 11, 62, 51] for its wide applicability (proof procedures for general dependencies 
also appear in [10, 68, 57]). It has been observed that the chase is a special case of a classical theorem 
proving technique, namely resolution [10, 11]. ‘The chase provides straightforward algorithms for 
implication of classes of dependencies for which it can be shown to terminate. Furthermore, in these 
cases the chase produces a finite counterexample whenever implication docs not hold; it is for this 


reason that finite implication coincides with unrestricted implication in these cases. 


Returning now to functional and inclusion dependencies, what appears to be the fundamental 
difficulty is preciscly that IND’s can prevent the chase from terminating. Of course, in the case of 
general FD’s and IND’s one cannot hope to circumvent this obstacle, since the implication problem 
is undccidable [54, 19]. Nevertheless, given the practical importance of these dependencies it makes 
sense to study the complexity of special cases. The obvious approach that has been suggested is to 
analyze the chase, but this turns out to be a very delicate task (cf. [43]), which can only give partial 


results [43, 26]. Thus, it seems that new tools are required in order to make major progress. 


The main contribution of this thesis is the introduction of such tools, borrowed from equational 
logic. This is a fragment of first-order logic which has attracted a lot of attention, because of its 
relevance to areas such as applicative languages, interpreters and data types (see [41] for a survey). 
However, it does not seem to have been noticed by the database theory community, since a constant 
effort has been made to minimize the role of equality in dependencies (multivalued dependencies 
(MVD's) [62, 51], the most widely studied after FD’s, do not involve equality). The only case where 
ideas from equational logic were applied in database theory seems to be the best algorithm for 
losslessness of joins (a basic computational problem), which was derived from an efficient algorithm 
for congruence closure [31]. Also, the best algorithm for implication of FD’s [6] can be scen directly 
(as we observe) as a special case of an algorithm of [47] for the generator problem in finitely presented 


algebras. 


We use the methods of equational logic to formulate and study implication problems involving 


FD’s and INI’s. We also use equations to define a new class of dependencies (generalizing FID’s) and 
to investigate its implication problem. In the subsequent Sections, we review in more detail the 


content of each Chapter. 


1.3 Chapter Two: The Equational Approach to Dependencies 


Let r be a relation over a sct of attributes U, with values taken from a domain 9%. Suppose r 
satisfics the FID ABC, i.c. whenever two tuples of r agree on A,B they also agree on C (here and in 
the sequel we consider single relations, so we can suppress relation names from dependencies). Let x 
be a variable ranging over the tuples of r and Iet a(x) (A(x), c(x)) be a function which assigns to a tuple 
x the entry of x at attribute A (B,C). Now since r satisfies AB—C, it is easy to sce that there is a 


function f(from $* to $) such that the following sentence is true in r: 


Wx. fa(x), Ax) = e(x) 


This observation suggests the following syntactic transformation: the FD AB-C is rewritten as 
an equation 
faxbx =cx, 
where now the symbol a (b,c) is a function symbol of ARITY 1 representing the attribute A (B,C) and f 
is a function symbol of ARrIry 2 corresponding to the FD. Using the standard convention of 


equational logic, we omit the universal quantifier on the variable x. 
We now illustrate how this equational formalism can be used to infer FD’s. 


Example 1.1: Given the FD’s 
AB), A> Bp), B) By -C 
we can infer the FD AC. Using our transformation, the given sct of FD’s produces the equations 
fjax=b)x, fax = b x, gb) xbox=cx. 
From these we can infer the equation 


gfaxf,ax=cx. 


In general, we can infer an FD such as A-»C if we can infer an equation 7[x/ax]=cx, where 7 is a 
term over the f's and a variable x (in Example 1.1, 7 is the term gf}xf,x). The notation r[x/ax] means 


that we substitute ax for x in +. 


Interestingly, this equational formulation can be extended to IND ’s as well. Suppose relation r 
satisfies the IND A, A,CB,By, ic. for cach tuple t of r there is a tuple t’ of r such that the values of t’ 
on B;,B, are the same as the values of t on Aj,A) respectively. This micans the following sentence is 
true inr: 

Wx Ay. [by(y)=ay(x) A Byly)= a(x} 
(as before, x,y are variables ranging over the tuples of r and a),a),b),b) are functions corresponding to 
the attributes A,,A>,B),B,). 
Consider now the Skolemization of the existential quantifier Jy: one optains the sentence 

Wx. [b(i(x))=a,(x) A bf i(x)) = a,(x)], 
which is true in r for some suitable function (x) (from tuples to tuples). ‘This suggests transforming 
the IND A,A,CB)B, into the sef of equations 

byix =a x, boix = ax 


(here i is a function symbol of ARITY 1 corresponding to the IND). 


Example 1.2: From the dependencies 
A,A,CB By, A,A3C B2B;, By B, 
we can infer the IND A,A,A;CB,B,B, [16, 54]. Using our transformation, the given set of 
dependencies produces the equations 


byix = ax, byix = ax, 
bajx = apx, b3jx = a,x, 
fb»x = b3x. 


From these we can infer 

b3ix = foyix = fayx = fbyjx = byjx = a,x, 
i.c. we can infer the set of equations 

by ix =a)x, byix = ax, b3ix=a3x. 

In general, we can infer an IND such as A,A,A;CB,B,B; if we can infer a set of equations 
by7 =a4Xx, by7 = a5x, by7 =a3x, where + is some term over the i’s and a variable x (in Example 1.2, 7 is 
simply ix). 

Thus, we can use cquational reasoning to obtain a proof procedure for FD’s and IND’s, The 


soundness and comipleteness of this approach is demonstrated in Theorem 2.1. As a matter of fact, the 


soundness part (whenever an equation of the appropriate form is implied, the corresponding 


dependency is implicd) is easy and it should already be plausible from the preceding discussion. ‘The 
difficult part is completeness (whenever a dependency is implied, an equation of the appropriate 
form is implied). This is proved by a rather delicate induction, which shows that cquational reasoning 


can simulate the chase. 


We can also have a slightly different syntactic transformation of dependencics into equations. This 


transformation, however, does not have a straightforward semantic justification. 


Consider the FD’s in Example 1.1: We can transform them into the equations 
f;a=bj, fha=b», gb) by =c, 
from which we can infer the equation 
gfaf,a=c. 


The symbols a,b;,b>,¢ are now constant symbols representing the attributes A,B,,By,C. 


When approached this way, the implication problem for FD’s becomes a special case of the 
generator problem for finitely presented algebras [47], for which [47] gives a polynomial-time 
algorithm. By inspecting the behaviour of [47]’s algorithm in this special case, we obtain the linear- 


time algorithm for implication of FD’s given in [6]. 


This alternative transformation can also be extended to IND’s. We transform the IND 
A,A,CB,B, into the sct of equations 
ib; =aj, ib) =a). 
Observe that we have now eliminated the variable x, which can play an essential role when IND’s are 
combined with FD’s (cf. Example 1.2). For this reason we also need equations of the form 
fix = ifx, 
which permit us to move the f's over the i’s and vice versa. The soundness and compietencss of this 


approach is also proved in Theorem 2.1. 


The equational formulation of dependencies is more redundant than the standard one, since we 
necd to introduce new symbols (fs and i’s). On the other hand, inferences of dependencics now give 
us more information: whenever we infer a dependency o from a set of dependencics %, the 
associated term 7 (cf. Examples 1.1, 1.2) tells us how o results (in any database satisfying 2) by 


“composing” dependencies in 2. 


In the remainder of Chapter 2, we use our equational approach to prove several results relating to 


FD and IND implication. We first give a new proof procedure for FID’s and IND’s (Theorem 2.2). 
This proof procedure is different in spirit both from the chase and the proof procedure of [54] and it 
treats FID’s and IND’s in a symmetric fashion. ‘The equational tools come into play in the proof of 
completeness of this proof procedure. Usually, completeness is proved by constructing a database 
which satisfies a set of dependencies 2 but violates a dependency o (assuming o cannot be proved 
from ); see, ¢.g., [11, 54, 62]. In our case, we consider the set of equations §, obtained from and 


we construct an a/gebra which satisfies 8; but violates any cquation that could correspond to o, 


Our second result is a precise characterization of the complexity of acyclic IND’s and FD’s. 
Intuitively, a set of IND’s is acyclic [58] if it docs not contain any cycles of inclusions, such as 
{R:A,A,CR:B, By}, {R:ACS:B, $:B’CR:A‘} and so on. Acyclic sets of IND’s have been proposed 
as a uscful tool for database schema design [58]. One can easily observe that the implication problem 
for acyclic IND’s and FD’s can be solved in exponential time (the chase terminates in this case). NP- 


hardness lower bounds for the problem were obtained in [26]. 


We show that the implication problem for acyclic IND’s and FD’s requires exponcntial time 
(Theorem 2.4). The main observation is that, when all FD’s are unary (i.e. the left-hand side contains 
a single attribute), the equational inferences of Examples 1.1, 1.2 can be viewed as inferences in 
semigroups (Corollary 2.3). Such inferences can in turn simulate computations of an automaton with 
two pushdown stores. Since such automata are universal computing devices, we obtain a tight 
undecidability result for FD and IND implication (Theorem 2.3), Furthermore, the acyclicity 
condition on the IND’s corresponds to bounding the size of one of the pushdown stores, which gives 


us exponential time. 


1.4 Chapter Three: Application to Typed IND’s 


A usual assumption in database theory is that all database relations are projections of a single 
universal relation (Universal Instance Assumption [62, 51]). In practice this is not always the case, so 
one has the problem of testing the existence of a universal instance and the problem of adjusting the 
database relations to maintain the existence of a universal instance as the database is updated. Both of 
these problems are known to be NP-complete [39]. An alternative, weaker condition we may impose 
on a multi-relational database is pairwise consistency, i.e. every pair of the database relations is 


required to have a universal relation. This condition is easy to test and maintain, as described in 
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numcrous works on the subject (see [8] for a review). In fact, if the database scheme is acyclic [8] then 


pairwise consistency émplies the existence of a universal instance. 


Most of the theoretical work on dependencies is done in the context of databases consisting of a 
single relation, i.c. it assumes the existence of a universal instance [62, 51]. A natural question, then, 
is to investigate the effect of the weaker assumption of pairwise consistency on the implication 
problem, say for functional dependencies. Although the implication problem for FD’s is solvable in 
linear time assuming a universal instance [6], it is not clear even if it is decidable in the context of 


pairwise consistency. 


Let r},1) be relations over relation schemes R,{U,], R[U,] respectively. It is not difficult to sce 
that r),ry have a universal instance iff the projection of r, on U; MU, is the same as the projection of 


r, on U,NU, [1]. This can be expressed (with a slight abuse of notation) by the pair of IND’s 


RU, NU,CR,:U, NU, 
Ry:U,NU,CR,:UN Ud. 


These are examples of typed IND’s. An IND is typed [17, 48] if it has the form R:Aj...Ay,CS:Aq...Am. 
By the above observation, we can then formulate the implication problem for FD’s in the presence of 


pairwise consistency as an implication problem for FD’s and (typed) IND’s. 


In this Chapter, we apply the equational techniques of Chapter 2 to study the implication problem 
for FD’s and typed IND’s. The main tool we develop is a proof procedure for general FD’s and IND’s 
(Theorem 3.1). This proof procedure is different from the procedure of Theorem 2.2 and somewhat 
reminiscent in spirit of the axiomatization of [54]. We prove completeness of the procedure by 


showing that it captures (indirectly) equational inferences as in Examples 1.1, 1.2. 


By analyzing the behaviour of this proof procedure in the case of typed IND’s, we obtain a 
decidability result for typed IND’s and FD’s satisfying an acyclicity condition (Corollary 3.1). We 
then further specialize the proof procedure to the case of unary FD’s in the presence of pairwise 
consistency (Lemma 3,2). By a rather complicated analysis of derivations, we show that this 
implication problem is undecidable (Theorem 3.3). This provides a very tight undecidable case of FD 


and IND implication. 


Finally, we use Lemma 3.2 to show that there is no k-ary axiomatization (involving only FD’s and 


IND’s) for implication of unary FD’s under pairwise consistency (Theorem 3.4; the technical notion 


1] 


of a k-ary axiomatization is explained in Chapter 3). ‘This strengthens a previous result of [16] about 


non-existence of k-ary axiomatizations for FLD’s and IND’s. 


1.5 Chapter Four: Finite Implication of FD’s and Unary IND’s 


Given the importance of the finite implication problem, it is natural to ask if our cquational 
approach can be extended to finite implication. Unfortunately, there are difficulties. The 
completeness part of Theorem 2.1 is proved by analyzing a proof procedure (the chasc). However, in 


the case of finite implication of FD’s and IND's such a proof procedure does not even exist [54, 19]. 


Nevertheless, we can have a complete proof procedure for finite implication of FD’s and IND’s, if 
we restrict ourselves to IND’s with one attribute per side (unary IND’s). Unrestricted implication 
becomes rather uninteresting in this case, because FID’s and unary IN1I)’s do not interact in any non- 
trivial way (Proposition 4.1). However, in the finite case we have the following interaction: 


Jrom Ag Aq and A; DA, and...and Ay, yA and Ay 2Ag 
derive Aj-+Ag and A, DA, and...and Ay Ay,_, and AgDAy, 


(m odd). 
It turns out that this is the only non-trivial interaction: by turning the above observation into a set of 
inference rules (one for each odd m) and including the usual inference rules for FD’s [5] and IND’s 
[16], we obtain a complete axiomatization for FD’s and unary IND’s in the finite case (Theorem 4.1). 
The completeness proof is rather long and it involves an intricate construction of a finite 
counterexample relation. We also remark that this axiomatization leads to a polynomial-time 
algorithm for finite implication of FD’s and unary IND’s [44]. The class of FD’s and unary IND’s is 
the only known class of dependencies for which unrestricted and finite implication are both solvable 


without being identical. 


Interestingly, the above axiomatization can also be used to prove an analogue of Theorem 2.1 for 
finite implication of FD's and unary IND’s (Theorem 4.2). However, this result is weaker, in the 
following way. Suppose, for example, that we want to test if the FD AB is implied from a set of 
dependencies 2. In the unrestricted case we can show that, if AB is implicd, then there is a term + 
such that the equation t[x/ax]= bx is implied (cf. Example 1.1); i.c., 7[x/ax]= bx holds in all algebras 
which satisfy the equations corresponding to ©. In the finite case, we can only show that, for each 


algebra A as above, there is a term + (depending on A) such that the equation 7[x/ax]=bx holds in 
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1.6 Chapter Five: Partition Dependencies 


We have presented in Chapter 2 an cquational formulgytion of funvtional dependencies. One can 
also have another formulation of quite different flavor, using algebraiy operations on partitions (this 


seems to be a folklore observation, sce e.g. [15, 60]). 


Specifically, let r be a relation and for cach attribute A \et a, be the following partition of the set 
of tuples of r: tuples ts are in the same block of a, iff they agree on atgribute A. Now it is easy to see 
that r satisfies the FD AB iff 

TANT, 
or, equivalently, 


TA=TA°TR, 
TR=TA+TR 


Here < is the usual refines relation and *,+ are the usual product and sum opcration on partitions. 


We are thus Iced to consider general equations over *,+ and the a,’s. We call such equations 


partition dependencies (PD’s) [27]. 


We first compare the expressive power of PI)’s to that of previously studied database constraints, 
namely embedded implicational dependencies [34]. A first observation is that PD’s of the form 
TA=M™_R+7c can express symmetric transitive closure (Example 5.2). It follows by a simple 
compactness argument that such PD’s cannot be expressed by any set of EID’s (Theorem 5.1). On 
the other hand, PD’s are unable to detect complicated patterns of equalities in relations and for this 


reason they cannot express, for instance, multivalued dependencies (Theorem 5.2). 


We then study the implication problem for PD's. We observe that the (finite) implication problem 
for PD’s is equivalent to the uniform word problem for (finite) lattices (Lemma 5.1). This follows 
from two deep results of lattice theory, namely that (finite) equivalence relations can represent 
arbitrary (finite) lattices [66, 56]. Using techniques from universal algcbra [36, 47] and lattice theory 
[28], we show that these word problems are equivalent and they can be solved in polynomial time 


(Theorem 5.3). 
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Finally, we examine the problem of testing consistency [38, 64] of a database with a sct of PD’s. 
Using our polynomial-time algorithm for implication, we show that it can be reduced to testing 
consistency with a set of 1's [38]. It follows that the problem can be solved in polynomial time 


(Theorem 5.4). 
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result appeared in [27]. Theorems 5.1, 5.2, 5.4 were obtained jointly, and appeared also in [27]. 


The extension to gencral dependencics outlined in the concluding chapter is duc to the author of 


this thesis. 
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Chapter Two 


The Equational Approach to Dependencies 


We present in this Chapter the cquational formalization of functional and inclusion dependencies. 
Section 2.1 gives the necessary definitions and background from database theory and cquational 
logic. In Section 2.2 we present the main Theorem and its Corollaries. We use it in Section 2.3 to 
prove completeness of a new proof procedure for FID’s and IND’s. In Section 2.4 we apply the 


equational formulation to prove new lower bounds for FD and IND implication. 


2.1 Definitions 


2.1.1 Relational Database Theory 


Let U be a finite set of attributes and $ a countably infinite sct of values, such that UND=S. A 
relation scheme is an object R[U], where R is the name of the relation scheme and UCU. A tuple t 
over U is a function from U to S. Let U={A).,...,A,} and a, a value, k=1....,.n; if t(AyJ=a,, we 
represent tuple t over U as aja>...a,. We represent the restriction of tuple t on a subset X of U as t[X]. 
A relation r over U (named R) is a (possibly infinite) nonempty sct of tuples over U. A database 
scheme Dis a finite set of relation schemes {R [U}].....RglU ql} and a database d={rj,.....g} associates 
each relation scheme R,[U,] in D with a relation r, over U,. A database is finite if all of its relations 
are finite. A database can be visualized as a sect of tables, one for each relation, whose headers are the 


relation schemes (each column headed by an attribute) and whose rows are the tuples. 


The logical constraints which determine the set of legal databases are called database dependencies 


(62, 51]. We will be examining two very common types of dependencies. 


FD R:Aj...A, A (10) is a functional dependency {62, 51]. 
Relation r (named R) satisfies this FD iff, 
for tuples t), ty in r, t|[A]...A,]= to[A}]...A,] implies t[A]=t,[A]. 
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If n=], 1c. the left-hand side contains a single attribute, we have a unary functional dependency 


(u-FD). 


IND S:Dy...D,OR:C)...C,, (m0) is an inclusion dependency [16]. 


Relations s,r (named S,R respectively) satisfy this IND iff, 


for cach tuple t in s, there is a tuple t, in r with t[(C,J=t{D,}, k=1,...m. 
Ifm=1, we have a unary inclusion dependency (u-1D). 


F-quality of two columns headed by attributes A,B in a rclation named R can be expressed as a 


special case of IND’s: Use an IND such as R:ABCR:AA. ‘These dependencies are particularly 


illustrative of our analysis; we will use A=B to denote them. 


Database Notation: We use a graph notation to represent an input database scheme PD and a set of 
dependencies 2 (input schema). We construct a labeled directed graph Gy (sce Figure 2-1), which has 
exactly one node al, for cach attribute A, of cach relation scheme Rj. For each IND 
R,:D)...D,ER1:C)...C,, in Z, the graph Gy contains m black arcs (c},d%,...,(cl,.d2); each arc is 
labeled by the name i of the IND. For each FD Rj:A}...A,—-A in 2, the graph Gy contains a group 
of n red arcs (ala’),...(ata); the group is labeled by the name f of the FD and its arcs are ordered 


from 1 to nas listed above. 


We also construct two directed graphs ly and Fy (see Figure 2-1): The graph Iy has one node for 
each relation scheme name in P and arc (Rj.Ry) iff Gy contains some black arc (A) BY), The graph Fs 
has one node a for cach attribute A of D and arc (a,b) iff Gy contains some red arc (ak bk), We now 


define special syntactically restricted forms of input schemata: 


Acyclic IND’s: ly is acyclic [58]. 

Acyclic FD’s: Fy is acyclic. 

Typed IND’s. The black arcs of Gy are all of the form (AL, A‘) for relation names R;,Ry and attribute 
A [17, 48]. 


Typed IND’s are between occurrences of the same attribute names in different relation schemes. 
If we assume that all possible typed IND’s are in the input schema, (i.e., with some abuse of notation 


R:UNU’CS:UNU’ for all relation schemes R[U], S[U’] in database scheme D), then we have 
pairwise consistency PC(D) [48]. 
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Implication: We say that 2 implies o (ZEo) if, whenever a database d satisfies &, it also satisfies 
o. We say that 2 finitely implies o (ZF ,_,,0) if, whenever a finite database d satisfics 2, it also 
satisfies o. 
Clearly if Zo (implication) then LE ,_,,0 (finite implication), but the converse is not always true. 


Deciding implication of dependencics is a central problem in database theory. 


Since dependencies are sentences in first-order predicate calculus with equality, we have proof 
procedures for the implication problem (we denote provability as Zo). A proof procedure is sound 


if whenever ZF o, we have Xo; and complete if it is sound and whenever Za, we have Zk-a. 


‘he standard complete proof procedure for database dependencies is the chase [62, 11]. We now 


present the chase for FD’s and IND’s (cf. [43]). 


Chase: Given an input schema D, = and a dependency o, construct a set of tables T, with D’s 
relation schemes as headers. These tables are originally empty and will be filled with symbols from 
the countably infinite set $. Whenever we insert a new row of symbols from & in a table of T and we 
do not specify some of the entries of this row, we assume that distinct symbols from %, which have 
not yet appeared clsewhere in T, are used to fill these entries. We use tk for the k-th row of table R 


and t{X] for this row’s entries in the columns of attributes X. 


The initial configuration of T depends on o as follows: 
(i) If o is the FD R:A...A,—A: insert rows t), U, with the only restriction that 
UIA, ]=tyAy], k=L...,n. 
(ii) If @ is the IND S:D)...D ER:C}...C,y: insert t}. 


Every dependency in 2 produces a rule, as follows: 
If fis an FD in & the corresponding FD-rule is: 
<Consider T a database over symbols in $. If T docs not satisfy f, because two symbols x and y are 
different, then replace y by x in T>. 
Ifiisan IND R:XCS:Y in & the corresponding IND-rule is: 
<Consider T a database over symbols in 9%. If T does not satisfy i, because some t'[X] does not appear 


in the table S as some t‘[Y], then insert t* in S with t{[Y]=t'[X]> 


We will say that Zh o, if there is a finite sequence of applications of the FD-rules and IND- 


chase 


rules produced by & that transforms '‘I’s initial configuration to a final configuration satisfying: 
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(i) Ifa isan FD as above: ti[A]=tJA]. 
(i) Ifo is an IND as above: for some j, 


HD] =G1Cy), k=1,...,.m. 
Proposition 2.1: ZF 4,..0 iff Zo. ff 


An alternative proof procedure for FD’s and IND’s is provided by the axiomatization of [54]. If 
is a sct of FD’s and IND’s and o is an FD or IND, then ZFco iff o can be proved from © using the 


following rules (X,Y denote sets of attributes): 
1. (reflexivity) R:A A. 
2. (augmentation) from R:X—A derive RIXYA. 
3. (transitivity) from R:X— Ay, k=1,....n, R:A1...A,—7A, derive XA. 
4, (IND reflexivity) R:A}...AyCR:A]...Am- 


S.(IND transitivity) from R,:Aq...ApCR,:B)...B,, and R»:B)..B,CR3:C)..Cy, derive 
Ry:A}.. Am ER4:C}...C 


6, (permutation, projection and redundancy): from R:Aj...Ay,CS:B}...B,, derive 


RiAj Aj GS:B;,..B; where 1<j,<m, k=1....,p. 


7. (equivalence) from R:ABCS:CC and o derive +, where 7 is obtained from o by 
substituting A for one or more occurrences of B. 


8. (pullback) from R:Ay...A,ACS:B)...B,B and S$:B,...B,—B derive R:Ay..A,—7A. 


9.(collection) from R:A)...A,B)...B,,CS:A}...A,B)...B,, R:B,...B,CCS:B)...BC’ and 
S:B}...BR,-+C’ derive R:A}...A,B)...ByCCS:Aj...A,B}...BAyC* 


10. (attribute introduction) from  R:Aj...A,CS:B,..B, and S:B)..B,7B derive 
R:A}...A,NCS:B)...B, B, where N is a new attribute. 


Rules 1-3 are the standard rules for FD’s [5, 62] (written in our notation) and Rules 4-6 are the 
rules of [16] for IND’s without repeated attributes. The salient rule is attribute introduction (Rule 10). 
Whenever this rule is applicd, the attribute N is chosen to be an attribute which does not appear in 2 
or in any previous step of the derivation. Rule 10 is sound in the following sense: Whenever the 


antecedents are true in relations r,s (over relation schemes R,S respectively), there is a relation r’ 
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which differs from r only on a new column headed by N and which satisfies the conclusion. 


2.1.2 Equational Logic 


Let M be aset of symbols and ARITY a function from M to the nonnegative integers N. The set of 
finite strings over M is M*. Partition M into two sets: 


G={g€M| ARITy(g)=0 } is the sct of generators, 
O= {@EM| ARITY(@)>0 } is the sct of operators. 


Definition 2.1: S(M), the sct of terms over M, is the smallest subset of M* such that, 
1) every g in G is a term, 
2) if 7 ),..,7m are terms and @ is in O with ARITY(@)=m, then @7}...7,, is a term. 


A subterm of 7 is a substring of r, which is also a term. Let V={x,x),x,...} be a set of variables. 
The sct of terms over operators O and generators GUV will be denoted by J* (M). For terms TistaTe 
in J*(M) we have a substitution p={ (x,+-7,) | kK=1,....n }, which is a function from g*(M) to 
J*(M). We use p(r) or T[X/7},...X,/T p] for the result of replacing all occurrences of variables x, in 


term 7 by term 7,, kK=1,...,.n, where these changes are made simultaneously. 


Definition 2.2: A binary relation ~ on S(M) or J (M) is a congruence provided that, 
1) & is an equivalence relation, 
2) if ARITY(@)=m and r,~7,, k=1....,m, then @7)...7,,207}...T 
An equation ¢ is a string of the form + =r’ where 1,7’are in J* (M). We use the symbol E for a set 
of equations. We will be dealing with models for sets of equations, i.e., algebras. We consider each 
equation ¢c as a sentence of first-order predicate calculus (with equality), where all the variables from 


V are universally quantified. 


Definition 2.3: An algebra JA is a pair (A,F), where A is a nonempty set and Fis a set of functions. 


Each f in Fis a function from A” to A, for some n in N which we denote as type(f). 


Example 2.1: 
(a) A semigroup (A,{ +}) is an algebra with one binary operator which is associative, i.e., for all x,y,z 
in A we have (x+y)+z=x+(y+z). An example of a semigroup is the set of functions from N to N; 
together with the composition operation. [n semigroups we use ab instead of a+b. We also omit 


parentheses, without ambiguity. 
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(b) Ag, is an algebra with A=9(M). For cach @ in O we define a function @ in F with 
typ @)= ARITY(@): here we use the same symbol for the syntactic object @ and its interpretation. 
The function @ maps terms 7),....7,, from S(M) to the term @71...7,, (c., 6(7],.:7m) = OT.-Tm)- 
This algebra is referred to as the free algebra on M. From this example it is clear that we can without 


ambiguity use both @r,...7,, and 8(7),....7,) to denote the same term. 


(c) Let = be a congruence on S(M). Condition (2) of Definition 2.2 guarantees that the operations 
in O are well-defined on ~-cquivalence (or congruence) classes. ‘Thus we can form a quotient 


algebra S(M)/ with domain {{7] | + in S(M), [7] is the -congruence class of +} and with functions 


corresponding to the opcrators in O, 
(d) Observations similar to (b),(c) can be made for the set of terms Jt (M). 


Implication: Let ¢ be an equation and JA an algebra. .A satisfies c, or is a model for ¢, if ¢ becomes 
truc when its operators and nonvariable generators are interpreted as the functions of A and its 
variables take any valucs in the domain of A. The class of all algebras which are models for a set of 


equations E is called a variety or an equational class. We say that E implics ¢ (EF c) if the equation e 


is true in every model of E. 


Definition 2.4: An equational theory is a sct of equalities E (of terms over J*(M)), closed under 


implication. 
See [41] for a survey of cquational theories. 


We write Ec, if there exists a finite proof of ¢ starting from E and using only the following five 
rules: 
T=T, 
from 1 = 7 deduce t)=7}, 
Srom 1, = and T,= 73 deduce 7; =73, 
from T= T,, k=1,....m, deduce O74...7 = 97]...Tp, (ARITY(8) =m), 


from 7, = 77 deduce (11)= (72) (g is any substitution). 
Proposition 2.2: [14, 41] EF-r= 7" iff ERr=r! 8 


Proofs in the above system can also be viewed as reduction sequences, as follows [41]: Whenever 


EF+=r’, there is a sequence of terms 7 ....,7,, Such that tp is r, 7,, is 7’, and for k=0,...,.m-1 the 
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term tT, , | is obtained from 7, by rewriting a subterm (a) as (eo), where 0) =a) (0,=0}) is an 


equation in E and @ is a substitution. 


Let [ be a set of equations over terms in S(M) (i.¢c., containing no variables). Consider the 
equational theory consisting of all equations t= 7 ‘such that PRRr= +’ By Proposition 2.2 this theory 
induces a congruence =}: on S(M), where t= 7" iff [Fr=7". From cxample (c) above we sce that 
this congruence naturally defines an algebra S(M)/=y. If T is a finite set, S(M)/=r is known as a 


finitely presented algebra [47]. 


2.2 Functional and Inclusion Dependencies as Equations 


Let 2 be a set of FD’s and IND’s over a database scheme D and o an FD or IND. We will 
transform = into two sets of equations Ey and 8. We will show that Fo iff Ey E, iff 8 -6&,, 
for some sets of equations E,,8, whose form depends on = and o. We assume that D only contains 


one relation scheme. This simplifies notation, and there is no loss of generality. 


Transformation: From the dependencies in = construct the following scts of symbols: 


Mr= {f, | for each FD with n attribute left-hand side include one operator f, of ARITY n}, 
M; =i, | for each IND include one operator i, of ARITY 1}, 

M, = {a, | for each attribute A, include one operator a, of ARITY 1}, 

M, ={a, | for cach attribute A; include one generator a,}. 


Now Iet M=M,UM,UM,UM, and V={x,x,,x>,...4 be a set of variables. I*(M,) (J*(M,)) are the 
sets of terms constructed using operators in M-(M;) and generators in V. 

The set Ey consists of the following cquations (presented in string notation): 

1) one equation for each FD Ay...A,—-+A:  f,ayX...a,X=ax, 


2) m equations for each IND B)...B,,CA}]..Am: ayi,x=b x and ... and ap,ipxX=b yx. 


The set &y consists of the following cquations: 
3) one equation for each FD Ay...A,—A: fpay...a, =a, 
4) m cquations for cach IND B)...B,,CA}...Ap: ipa; =f, and ... and i,a,,=Bm, 
5) for cach pair of symbols f, in Mr and ig in Mj the equation  fyigk)..igX,=ighX1-Xp 


(ARITY(f,) =n). 


Note that in & only equations (5) contain variables. Equations (5) are commutativity conditions 
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between the f,’s and the i,’s. We now present Theorem 2.1, which is central to our analysis. 


Theorem 2.1: In cach of the following three cases, (i),(ii),(il) are equivalent. 
= Case: 
i) T= ASB 
ii) Ey  ax=bx 
iii) 85 F a=ZB. 


FD Case: 
i) ZF Ay...A, A 
ii) Ey Fe 7[x)/ajX,....X,/a,x] = ax, for some r in g* (My) 
iii) Sy F 7 [x,/0ry,..X,/@,]= a, for some z in J* (M9. 
IND Case: 


i) 2 & By...BAGA)-Am 
ii) Ey Fajr =b x and ... and a,,7 =b,,x, for some + in J*(M,) 
ii) Sy t[x/a,]= fy and ... and t[x/a,,]= Bp, for some 7 in J*(M)). 


Proof: Observe that the = Case follows immediately from the IND Case, by writing A=B as 
ABCAA. We usc E, (&,) to denote the set of equations corresponding to term r in (ii),(iii). 

(ii)=> (i): 
Suppose EyFE,, and let relation r satisfy 2; we will show that r satisfies o (o is Aj...A,—A in the 
FD Case and B)...B,,CA}...A,, in the IND Case). Relation r is, by definition, nonempty and its 
entries can be assumed w.l.o.g. to be positive integers. Let the tuples of r be t),t,,... (it could contain 
acountably infinite number of tuples). 
For cach attribute A in U, define a function a(.): N— N (Nis the set of nonnegative integers) so that, if 
v is the index ofa tuple in r, then a(y) is the entry in tuple t, at attribute A; clse a(v) is 0. 
For each FD Cj..C;C in &, define a function f...): MN so that, if a, =t,[C,], k=1.,...j, then 
Ka pod) = t,[C]; clse fa Jo) is 0. This is a well-defined function, since r satisfies Cy..C79C, 
For each IND D)j...DjEC}...C; in 2, define a function i.): NN’so that, if » is the index of a tuple in 
r, then (v)=v, where p “is the index of the first tuple in r where t,[D}...Dj]=t, [Cy...Gj]; else dv) is 0. 
This is also a well-defined function, since r satisfies D)...D;CC}...G. 
We have constructed an algebra with domain Nand functions a(.),...f{...),-/(.),..., Which, as is easy to 


verify, is a model for Ey. Let o be an IND. By interpreting cach symbol in 7 as an i(.), we see that, 
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when v is a tuple number, t[{x/y] is another tuple number. Since EyFEE,, we must have 


a,(7[x/v]) = b (x), k=1,....m, which means that r satisfies o. The case of an FD is similar. 


(iii) => (ii): 
Suppose &yF6&,, and Ict Ab be a model of Ey; we will show that Mb satisfies E,. From Ab we 
construct a model A(Ab) for 8. ‘The domain of A( Ab) is the set of all functions from Ab to Ab, i.e. 
Ab— Mb. 
In A(Ab) the interpretation of a is the function a(x), which is the interpretation of a(.) in Ab. The 
interpretation of i(.) is the function AA.A((x)), where i{x) is the interpretation of i(.) in Ab. This is a 
function from MAA to MAb. The interpretation of f€..) is the function 
Ahy.. Ay fA (),.. g(x), where A(Xxj,....X,) is the interpretation of fC...) in Mb. This is a function from 
(Ab—+ Ab)" to Mb Mo. 
It is straightforward to check that equations (3),(4) hold in A(Ab), because Ab is a model for Ey. 
Also equations (5) hold in A(Ab): For example, if n=1 the interpretation of f(i(h)) in A(Ab) is 
AAC), which is also the interpretation of i(f(A)) (h is any element of As Ab). Thus A(Ab) is a 
model for Gy. Since SyF&,, A( Ab) satisfies &,. From this it follows that Ab satisfies E,. 


(i)=> (iii): 
IND Case: 
Consider a chase proof of By...B,,CA,..A,, from 2%. This chase starts from a single tuple t, and 
gencrates tuples ty,...,t,, where t,[A,...A,,]=t)[B,...B,,]. Now a tuple can only be generated by 
applying an IND-rule on some previously generated tuple. Thus, we can assign (inductively) to each 
tuple t,, p=1,...,v, a term 7, in J (M;), as follows: 
1.7, =x. 
2. If ty was generated from ty 9p, by applying the IND-rule corresponding to some IND i in 2, then 
Tp = TlX/ix]. 


The term +, records the sequence of applications of IND-rules which produced ty (starting from t,). 


p 


We will show the following 


Claim: For 1<p,q<v, C.D in U, if tp[C]=t,[D], then SyFrp[x/y]=74[x/5], where y,6 are the 


symbols in M, corresponding to C,D. 


Clearly, the IND Case follows from the Claim: Since t,[A)..Aq]=t[B)...B,], we have 
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SyF7,[x/a,]= By. k=1,....m. 


Proof of Claim: Suppose the equality tyfC] = t [D] appears after exactly z steps of the chase. We 


argue by induction on z. 
Basis: 2=0. Then p=q=1, C is D, and the conclusion is straightforward. 


Induction Step: Let t[C]=«, tID]=A. The symbols x,A were equated by the chase. We 


distinguish three cases, according to how this happened. 


a. « is a freshly created symbol, identical to A. This means t, was created from t,., p ‘<p, using an 
IND X\C°X,CY|CY, in Z (X,Y, CU, k=1,2), and t,-[C’]=t,[D]. By the induction hypothesis 
Sy Fr, [x/y]= 7 y[x/8]. Now 1) = T){x/ix], where i is the operator corresponding to 
X)CX,CY CY), and also iy=y’is in Sy. Thus, 837, [x/iy]= T qlx/5], ic. Ey lx/y]= T lx/6]. 


b.« was equated to A in order to satisfy some FD Cy.G4C in 2%. This means 
tpCy...C] = tolCy ...Cj], and D is C. By the induction hypothesis Byker p[x/yy]= 7 4lx/xh k=1....j. 
Also, we have in Sy the cquation fy}..¥j=7: where f is the operator in M; corresponding to the FD 
C)..Cj-+C. Thus, &» implics Fr [x/y,)...7 p[X/ Yj] = 7 plx/F14---Y4] (by the commutativity conditions 
(5))  =r,[x/y]. Similarly 8; implies fr glx/yq].--T glx/¥j] = TqlX/fY1.-¥{] = TqlX/ 7], so 
&y = THlx/y] 7 T qlX/Y]. 


c. There are tuples t, tq’, p’<p, q’<q, and C’,D’in U such that t,{(C]=«, ty[DJ=A, and ty [C] 
was equated to tg1D] at some carlicr step. Then by the induction hypothesis $y implies 
Tplx/Y] = Tp 1x/y], Tlx/8]= 14 1x/8 J, and Tp [x/y 1= 7 g1x/8 1. Thus, Sy TplX/y] = Tqlx/6]. 


FD Case: 
Consider, as before, a chase proof of A,...A,—A from &. This chase starts from two tuples t,,t, and 
generates tuples t....,t,; finally, t[A]=t,[A]. Again a tuple can only be generated by applying an 
IND-rule on some previously generated tuple, so we can assign (inductively) to each tuple ty» 
p=.....v, a term 7, in J* (M;), as follows: 
1.7, =X), T2=X. 
2. If t, was generated from ty. <p. by applying the IND-rule corresponding to some IND i in 2, then 
Ty = TqlX1/iX], Xp/iXg]. 


Observe that 7, also records the tuple (t; or ty) which produced ty (apart from the sequence of 
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applications of INID-rules). 
We will show the following 


Claim: For 1<p.q<v, C.D in U, if HIC]=tgD], nen Sy r[xp/y]= rqlx,/8] (k= 1,2). IE, 


additionally, is produced from t, and t, is produced from t,, then &, implies 
b 1 q 2 Py 


Tol /Y]= TolX>/8]= 71x} a )....X,/ap], for some 7 in J* (Mop. 
pl*1 ql*2 1] nf @y f 


Clearly, the IND Case follows from the second part of the Claim: Since t,[A]=t)fA], 
SyFa =7[x)/a),....X,/0,], for some + in I* (Mp). 


Proof of Claim: Suppose the equality tC] =t{D] apncars after exactly z steps of the chase. We 


argue by induction on z. 


Basis: z=0. Then p=q=1, C and D are both some A,, 1<k<n, and the conclusion is 


straightforward. 


Induction Step: Let tpIC]= «, t({(DI=A. The symbals «,A were equated by the chase. We 


distinguish three cases, according to how this happened. 


a. « is a freshly created symbol, identical to A. This means ty was created from ty p <p, using an 
IND X,C’X,CY,CY, in 2 (X,,.Y, CU, k=1,2), and ty IC }=t{D]. For the first part of the Claim, 
we argue exactly as in the IND Case. For the second part, note that if ty is produced from t, then so is 


t Therefore we can use the induction hypothesis on tyrity. 


b.« was equated to A in order to satisfy some FD C,..C;+C in 2%. This means 
tylCy...C] = tICy...G], and D is C. The argument for the first part proceeds exactly as in the IND 
Case. For the second part, note that since & y implies 7 )[x)/y4]= 74 [x4 /0,--.X_/aq), k= 1... 

(by the induction hypothesis), we have that 8y implies 
THLX)/Y] = TplXy/Fy1.--Yj] = fro[x MYT plx47y] =fr X47 0,2. X ZO g]-7 [Xy/ 02] Xp_/ Og] = 
= 1[x)/),...X,/Qq], where 7 is fry...7). Similarly, 8 implies 


TlXq/Y]= T1X}/0],...X_/ Oy] 


c, There are tuples tyitq> P’SP, <q, and CD’ in U. such that {C= K, ty[DI= A, and tyICI 
was cquated to tgI1D1 at some earlier step. The argument for the first part proceeds exactly as in the 


IND Case. For the second part, if ty: was produced from ty, use the induction hypothesis on torty’s 


25 


else, if tg: was produced from t,, use the induction hypothesis on tors tg’ else, use the induction 


hypothesis on tg’ ty: 


This concludes the proof of (i)=>(iii), so we are done. Il 


We remark here that the (i)=>(iii) direction can also be proved by showing that each of the rules 
of [54] (see Subsection 2.1.1) can be simulated using the equational reasoning of Proposition 2.2, We 


illustrate this simulation with an example: 
From AB and CDCAB the pullback rule of [54] derives C-+D. In cquational language fa=8, 


ia=y, i8 =6 and fix =ifx imply fy =fia=ifa=iB=68. 


Corollary 2.1: Let 2 be a set of FID’s and 6 an FD. The implication problem ZF co is equivalent 


to a generator problem for a finitely presented algebra [47]. 


Proof: Sy is now a finite set of equations with no variables. If ~ is the congruence induced by 5 
on J(M) then 3(M)/= is a finitely presented algebra. The equational implication in Theorem 2.1 is 


known, in this case, as a generator problem for the finitely presented algebra S(M)/~. 8 


Using Corollary 2.1, one can observe that the linear time algorithm of [6] for implication of FD’s 


can be derived in a straightforward way from the algorithm of [47] for the generator problem. 


Corollary 2.2: Let & be a set of FD’s. The implication problem XE=A=B is a uniform word 
problem for a finitely presented algebra {47]. Ul 


If the given FD’s are all unary, then the equational inferences in the theory Ey can be thought of 
as inferences in semigroups. This gives yet another transformation of (unary) FD’s and IND’s into 


equations: 


Semigroup Transformation: Let 2 be a sct of IND’s and u-FD’s. Construct a set of symbols M, 
from M as follows: for cach f,(.) in My add one generator f, in M,; for each i,(.) in M; add one 


gencrator i, in M,; for cach a,(.) in M, add one generator a, in M,; add one binary operator + in Mg. 


The set of equations Eg consists of the associative axiom for + and the following word (string) 
equations (we omit + and parentheses): 
1) one equation for cach u-FD A;-A: fa; =a, 


2) m equations for each IND B)...ByCA}...Am! ayi=b, and ... and api, = Dy, 
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Corollary 2.3: Let 2 be asct of u-FD's and IND’s: 
ZE ASB iff Eg a=b. 
ZEA, A iff Eg wa, =a, for some string w in Mf. 
LF B)...ByCA)..A,, iff Eg ayw=b, and... and a,,w=b,,, for some string w in M¥. 


Note that the first case is an instance of the uniform word problem for semigroups. The other two 


cases are known as Eg-unification problems [41]. 


2.3 A Proof Procedure for FD’s and IND’s 


We will now describe a proof procedure for FD and IND implication, which exploits the special 
structure of the cquational theory &) (Theorem 2.1). Whenever a dependency o cannot be proved 
from a sect of dependencics 2, the procedure provides us (in a natural way) with an algebra which 
satisfics & but violates any cquation that could correspond to o. Thus, by Theorem 2.1 we have that 


2 docs not imply a, i.c. the procedure is complete for FD and IND implication. 


The Proof Procedure G: 

Given a set 2 of FD’s and IND’s construct their graphical representation Gy, defined in Subsection 
2.1.1. Each attribute name in & is associated with one of the nodes of Gy. 

Rules: Apply some finite sequence of the graph manipulation rules 1,2,3 and 4 of Figure 2-2 on Gy. 
Rules 1 and 2 introduce new unnamed nodes. Rules 3 and 4 identify two existing nodes; the node 
resulting from this identification is associated with the union of the two sets of attribute names that 
were associated with each of the identified nodes. Note that rules 1,2 w.l.o.g. need be applied at most 
once to every left-hand side configuration. 


Let G be the resulting graph. Associate a unique new name with every unnamed node in G. 
We say that ZF ¢o when: 


o is A=B: A,B are associated with the same node. 
o is an FD Aj...A,—A: The node associated with A gets marked by the following algorithm: We 
mark the nodes associated with Aj,...,A,; whenever nodes yj,...,¥j are marked and there is a group of 
red arcs (V},V),..Vj.V) labeled by the name f of some FD in %, we mark v. 


o isan IND By...B,,CA}...A,,: For k=1,...,m there is a black directed path from A, to B,; morcover, 


m= 


all these paths have the same sequence of labels. 
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Note that, as expected, the A=B Case is a specialization of the IND Case: if 2k (~ABCAA, then 


A,B can be identified using Rule 3. 


Theorem 2.2: Zo iff Zo. 


Proof: 
(<=): Rules 3,4 are obviously sound. Rules | and 2 are sound in the sense of the attribute introduction 


rule of [54] (sec Subsection 2.1.1), which we illustrate as rule 5 of Figure 2-2. 


(=>): Let G be a (possibly infinite) graph obtained by closing Gy under Rules 1-4. We will 
construct from G a model Ab of By. 
The domain Af of Ab is the set V of nodes of G, together with a special node L. The gencrator ay is 
interpreted as the node associated with A,. 
An operator i in Sy (corresponding to some INI) in %) is interpreted as a function i:M—+M as 
follows: if v is in V and has an outgoing arc (v,w) labeled i, then i(v)=w; else {v)=L. This function 
is well-defined, because G is closed with respect to Rule 3. 
An operator f of ARITY j in Sy (corresponding to some FD in £) is interpreted as a function fMM 
as follows: if Vj,.¥j are in V and there is a group of red arcs (vj,V),.-(¥,v) labeled f, then 
AV VJ =V5 else AVy,--.V) = LL. This function is well-defined, because G is closed with respect to 
Rule 4. 
One can check that “Ab satisfies the commutativity conditions (5) of Sy (because G is closed with 
respect to Rules 1,2) and Ab satisfies equations (3),(4) of Sy (because G was constructed starting from 
Gy). Thus, Ab is a model of Ss. 
Now suppose we cannot prove o from %. If o is an FD Aj...A,—+A, then clearly there is no 7 in 
g* (Mp) such that r[x,/a),....X,/a,]J=a in Ab. Thus, Ab is a counterexample to condition (iii) of 


Theorem 2.1 and therefore X does not imply o. Similarly if o isan IND. I 


2.4 Computations as Inferences 


It has been known, since at least Post’s proof of the unsolvability of the word problem for Thue 
systems [55, 50], that arbitrary computations can be simulated by inferences in semigroups. Using 
Corollary 2.3, we show that we can simulate computations by inferences of IND’s and unary FD’s. 


We thus obtain lower bounds on the complexity of the implication problem for IND’s and FD’s. 
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We first describe our machine model: A deterministic two-stack machine M is a 5-tuple 
(QT dyarsh.5), where Q is a finite set of states, Tl is a finite set of symbols (QQNN=®), Aan €Q is 


the start state, hEQ is the halt state, and 6 is the transition function. Each move of M falls into one of 


the following two types: 


1. 8(q.a)=(p,PoP,): This means that, if M is in state q and @€TIT is the top symbol of 
STACK, then on the next step M goes to state p and pops STACK}. 


2. 6(q)=(p,PUSH,(B)): If M is in state q, then on the next step M goes to state p and pushes 
BET on STACK. 
Of course, analogous instructions can manipulate STACK. 
An instantaneous description (ID) of M is a string X)...X,4Yq.--¥1}, where q€Q, x,y;EIT: the string 


X)...X, is the contents of STACK, (the top symbol is x,); the string y,,...y; is the contents of STACK, 
(the top symbol is y,,). The relation w;=>.,w» (ID w, yields ID w, via one step of M) is defined in 
m 17 M"2 1 2 


the standard way [50, 40]. =>; is the reflexive, transitive closure of =>. 
Let us now define a set S of word equations (over generators QUIT) which capture the 
computation of M: 


1. If 6(q,a)=(p,PoP,), then aq=p isin S. 
If 5(q,a@)=(p,POP,), then qa=p isin S. 


2. If 8(q)=(p.Pusil,(8)), then q= Bp is in S. 
If (q)=(p,PusH,(B)), then q=pf is in S. 


We write u=gv iff SE=u=v. By a standard argument, based on the fact that M is deterministic 


[55, 50], we have 
Lemma 2.1: quan? iff quar =sh. 


To prove our first lower bound, we transform S into another sct of equations T which looks like 
the sets obtained (as in Corollary 2.3) from IND’s and u-FD’s. The set of generators is now 
QUI{A,,By.fy | aE NFU Li, | aEMPUL, | c€SH. 


1. Ifqa =p is in S, then qi, =p is in T. 


2. If aq=p is in S, then T contains the cquations q=A,j., fyAg = By, Baje=p, where e is 
aq=p. 
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Lemma 2.2: qgan = sh iff Quan = rh. 


Proof: Given a word w over QUI of the form a)...a,q8,,.-B), GEN, a,B,€N, define a 


corresponding word w’ to be f,....f, dip ...ig,. We claim that, if w).w» are words over QUIT, then 
P a) a4 Bm By 12 


Wy) =5W, iff w) =7w3. The Lemma follows from this claim, 


To prove the “only if” direction of the claim, consider the equations in § that can be used to 
rewrite W) aS W>. If qa=p is in S, then qi, =p, since qi, =p is in T. If aq=p is in S, then fyq=-p, 


since foq=1f Aglo=1Bale=1P- The converse is also straightforward. i 
Theorem 2.3: The implication problem for IND’s and two u-FD’s is undecidable. 


Proof: Given a deterministic two-stack machine M, it is undecidable if qy,=> Rh, even if |M{=2 
[53, 40]. By Lemmas 2.1 and 2.2, qgan=>sh iff Ggap=yh. By Corollary 2.3, dgay=y7h_ iff 
LE QyanZ=H, where & is the set of IND’s and FID’s which gives rise to T. But now observe that Z 


only contains FD’s of the form A,B, a€11. Since |M|=2, = only contains two unary FD’s. 


Undecidability of the implication problem for IND’s and FID’s has already been proved [54, 19]. 
By way of comparison, these reductions use arbitrarily many IND’s of the form D,D,CC,C, and 


arbitrarily many u-F1D’s, while our reduction uses arbitrarily many IND’s and only two u-FD’s. 


To prove our second lower bound, we consider computations of a deterministic two-stack machine 
M where onc of the two stacks has bounded size. Let us write w;=>}4W> iff ID wy follows from ID w, 


by a computation of M during which STACK, contains at most s symbols. 


Let S be the set of word equations described before: this time we transform S into a set T® of 
equations which can be obtained (as in Corollary 2.3) from acyclic IND’s and u-FD’s. The sct of 
generators now is Q°U...UQSU{A,.B,.f, | a€EM}Ufi, , | a€lT, k=1,....s}U 
Ufje, |e€S, k=0.....s}, where Q* = {q* | qEQ}, k=0....,s. 


1. Ifqa=pis in §, then q**1i, 44) =p* is in T, k=0...., 8-1. 


2. If aq=p is in S, then T* contains the equations qa Aig. f,Ag= By, Balex= P's 
k=0....,s, where e is aq=p. 


It is not hard to sce that T* can be taken to represent a set &° of acyclic IND’s and u-FD’s: the 


relation names are R[A,B, | @€ 0], RO), k=0....,s. It is also easy to see the following 
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Lemma 2.3: dgan=>yh iff qyay = ys h°, iff SE Ro: QOH". fl 


Theorem 2.4: ‘There are constants c),c)>0 such that the implication problem for acyclic IND’s and 


F's can be solved in time c} but not in time of nEHOEN 


Proof: Since the IND’s are acyclic, the chase gives us a decision procedure, running in exponential 
time. 
To prove the lower bound, let L be any language in DTIME(c"), ©>0. We will show that LL is 
polynomial-time reducible to the implication problem for acyclic IND’s and u-FD’s. 
Let M be a deterministic n-AuxiliaryPushdownAutomaton accepting | [40]. Given string x, we 
construct a deterministic two-stack machine M, which first puts x on STACK, and then simulates 
M. This simulation is done as follows: if M is in state q, its auxiliary storage contains a@)...a,aWw (a is 
the symbol! scanned) and its stack contains uf (B is the top symbol), then the ID of M, is 
uB a) g...p gqaw. It is not hard to see how M, can simulate a move of M. Thus, M accepts x iff M, 
halts and STACK, always contains at most |x| symbols, i.c. x€L iff qgaq=> Va Note also that the size 
of M,, [M,|, is O(Ix|). 
Now Ict ="! be the set of acyclic IND’s and u-FD’s corresponding to M,. Using Lemma 2.3, x€L iff 
rile R® Qo an=H°. To complete the proof, observe that ="! can be computed from x in 


polynomial time, and that the size of =!*! is O(|M,| |x| log|x|), i.e. O(|x|"log|x}). 
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D={R,Lr6c), RLABDI} 


| > rel 
Zzt R,: ABD, 3 
R,: BC, | 
R,: ABS R,:AB, 
RASRB } 


Figure 2-1: Graph notation for FD’s and IND’s 
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Rule 3 u 
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Rute 5 
LMitchelt J 


Figure 2-2: Graph rules for FD’s and IND’s @) >: new node 


Chapter Three 


Application to Typed IND’s 


In this Chapter we use the tools developed in Chapter 2 (Section 2.2) to study the particular 
implication problem for FD’s and typed IND’s. We first present a proof procedure for general FD 
and IND implication (Section 3.1), similar in spirit to the proof procedure of Theorem 2.2. By 
specializing this proof procedure to typed IND’s, we obtain as a corollary that the implication 
problem for acyclic FID’s and typed IND's is decidable (Section 3.2). In Section 3.3 we study the 
special case of inferring FD’s under pairwise consistency. By analyzing derivations (in the proof 
procedure of Section 3.1), we show that the problem is undecidable. We also prove that there is no k- 
ary axiomatization for implication of FD’s under pairwise consistency. As a by-product of our 


techniques, we obtain finite controllability of acyclic unary FD’s under pairwise consistency. 


3.1 Another Proof Procedure for FD’s and IND’s 


We present in this Section a proof procedure for general FD and IND implication. This procedure 
is the main tool we use to study the implication problem for typed IND’s and FD’s. To prove 
completeness of the procedure, we show that it captures (in an indirect way) equational inferences in 


the theory Ey of Theorem 2.1. 


Let 2 be a given sect of FD’s and IND’s over a database scheme D, containing a single relation 
scheme R[U]. We represent attribute A,€U by a node ay. An FD Aj...A,—A in & is represented as 
shown in Figure 3-1 by introducing a node faj...a, (we use a different function symbol f for each 
given FD), a group of directed arcs (a), fa)...a,),....(a), fa}...a,) labeled f and ordered from 1 to n, and 
an undirected arc <faj...a,, a>. The undirected arc is the only modification to our graph notation of 
Section 2.1.1. Its purpose is to represent the equation fa;x...a,x = ax. 

An IND B...B,CA}..A,, in 2 is represented (sce Figure 3-1) by introducing directed arcs 


(a),b)),...,(a),0,,), labeled i (we use a different label for cach given IND). 
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Let Hy be the mixed graph obtained from 2 as described above. Repeatedly apply Rules 
T(transitivity), Ey. (equality), 1,., (introduction) (sce Figure 3-2) on Hy, in some arbitrary fixed 
order, until no more rules are applicable. As was the case with Rules 1,2 in Theorem 2.2, the 


introduction rules necd only be applied once for each left-hand side configuration. 


Tet H=(Ny);,Ay,,F})) be the mixed graph obtained this way (Nj, is a set of nodes, Aj, is a set of 
labeled directed arcs on Nyy, and Ej; is a sct of undirected arcs on Njj). Notice that cach node of H is 
labeled Fu jeg, where F is a term over the function symbols and UzyeUg are nodes representing 
attributes (by a slight abuse of notation, we write Fuj...Ug as a shorthand for F[x)/Uj,....Xg/Ugl). 
Moreover, every subterm of Fu j--Ug appears as a node of H. 

By a path labeled +, where + is a term over the i’s (and a variable x), we mean a mixed path where the 
sequence of labels corresponds to + (see Figure 3-1). In the special case where 7 is simply x, the path 


consists of undirected arcs. 
The graph H fully captures implication of FD’s and IND’s from &, as we now show: 


Theorem 3.1: 


FD Case: 
2ZEAj...A,—A iff there is a node Faj...a, of H such that <Fay...a,, a>€ Eyy. 


IND Case: 
2 3B)...B,,CA}...Am iff there is a path from a, to b, labeled 7, k=1.,....m, where + is a term over the 


“9 


1s. 


Proof: [Lect Ey be the set of equations of Theorem 2.1. Assume that the various names in Ey are 


consistent with the names in H. 


(=): 
Claim: 
(i) If <Fuy...u,, Gv)...vg2€ Ey, where the u,’s, v;’s are nodes corresponding to attributes and F,G are 


terms over the f’s, then Fy Fujx...upx = GvjX...VgX. 


(ii) If (Fuj...up, Gv}...Vq) is a directed are labeled i, then Ey Fuyix...upix =Gvjx...VgX. 
Clearly, the "if" direction follows from the Claim, by Theorem 2.1. 


Proof_of Claim: We prove both (i) and (ii) by simudtaneous induction on the number of 
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applications of rules that created an (undirected) are of H. 
Basis: No rules were applied. ‘Phe conclusion is straightforward. 


Induction Step: We have to check Rules 'T, Ej.3, 1).3, each of which might have been applied at 


the last step. 
Rules T, E, Straightforward. 


Rule E, The undirected arc <Fuy...u,, Gvj...vg> was created from the undirected are 

a) 1 1 

<F'u}..Up, G'Vj)...Vg>, where (FU;...Uy, Fuy...up), (G'V]...Vg-, GV)...Vg) are directed arcs labeled i. By 
the induction hypothesis, Ey implies Fu; x...uy-x= GV)x...Vg-X, FUpix...uy ix = FuyX...upX, 


G Vjix...vg-ix=Gv,x...vgx. Thus, Ey implics Fu; x...U)x = Gv1X...VgX. 


Rule 1, The undirected arcs <Fyuy...Up, GyVz...Vg>s..SFqly.Up, GyVj...Vg> create the undirected 
arc <Fuj...Up, Gvy...Vg>, where F=fF)...F,, G=fG,..G,. By the induction hypothesis, Ey implies 
FyUjX...UpX = GyV)X...VgX, k= ],...,n. Thus, Ey implies 


Fup X..UpX = PF] U)X...UpX «1. Fup X..UpX = PG 1 V4X...VgX .. GV] XoVgX = GV}X...VgX 


Pp 


Rule I, The directed arcs (Fyuj}...Up, GyV}...¥g),.-(FpUy.Up, Gp¥j..-Vg) (labeled i) create the 
directed arc (Fu,...Upy, Gvj..¥g) (labeled i), where F=fF)..F,, G=fG}..G,. By the induction 
hypothesis, Ey implies Fy ujix..upix=G,v)x..¥gx, k=1,...n. Thus, Ey implies 


Fuyix...Upix = fFyujix...Upix ... Fuypix...Upix = PG Vv) X...VgX ... GaV)X..VgX = GV] X...VgX. 
Rule I, Identical to Rule I). 


(=>): Let u be a node of H labeled Fuj...u,, where the u,’s are nodes corresponding to attributes. 


We denote by ur the term Fu)r...u,7. 


p 


Claim: Suppose Ey implics Fujt...Upt = GV 1p...Vgp, where the u,’s, v's correspond to arbitrary 
nodes of H, F,G are terms over the fs, and 7,p are terms over the i’s (and a variable x). Also assume 
Fu,...u p is a node of H, and there are nodes w,, k=1.,...,.p, such that there is a path from uy to wy 


Labeled t. Then Gy)...Vg is a node of H and there is a path from Gy)...vg to Fwy...w, labeled p. 


The "only if” direction follows casily from the Claim, by Theorem 2.1. 


Proof of Claim: If Ey-o=o’, then there is a sequence of terms 09,....0,, Such that oo is a, op, is 
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o’, and for k=0,....m-1 the term o,,; is obtained from o, by rewriting a subterm @(@)) as p(4>), 
where 8, = 8) (8,=8)) is an cquation in Fy and is a substitution (Proposition 2.2). We call such a 


sequence a proof of the equation o=oa"’. 
We define a relation ~< on pairs of terms as follows: 


(.6 )<(9.7) iff Ey implics § =f" and y= 9°, and cither 
(i) the shortest proof of { = ¢’ is shorter than the shortest proof of y=‘, or 


(ii) the above proofs have the same length, and ¢ is a proper subterm of n, ¢" is a proper subterm of 7- 


Obviously, < is well-founded, so we can argue by induction on <. I ct o9,...,0,, be a shortest 


proof of the equation Fu;t...uy7 =Gvjp...Vgp. 


Basis: m=0. Using I,, 1, we sce by an easy induction on the structure of F that there is a node 


Fw)...w, and a path from Fuy...u, to Fw,...w, labeled 7 (sce Figure 3-3). 


Induction Step: We assume that the Claim holds for all equations {= ¢" implicd by Ey, where 
(0,5 )-<(Fujr...u,7, Gv) p...vgp); we will show that it holds for the equation Fuyr...u,7 = Gvp...Vgp. 


We distinguish two cases: 


Case 1: For k=0,....m—1, o,,.; is obtained from o, by rewriting a proper subterm. This means F is 
fF}...F,, G is RGy...G,, and Fyuyt...up7 is rewritten as G.vyp...Vgp, S=1,....n. Now for s=],....n, 
F,uj...Up is a node of H and (F,u47...Upt, GV) p...Vgp)—<(Fuj7...Up7, Gv }p...Vgp), so by the induction 
hypothesis G,v,...vq is a node of H and there is a path from G,¥}...vg to F,W,...Wp labeled p (see 
Figure 3-4). Now by Rules 15, 1; and an easy induction on the structure of F,, there is a path from 
F,uj...Upy to F,W}...W, labeled 7; then by Rules I), I there is a node fF) wy...Wp ... F,Wy..Wp, Le. a 
node labeled Fw)...w,. It follows by Rules Ij, I; that there is a node fGyV}...Vg ... GaV)...Vg, Le. a node 


Gyj...Vg, and that there is a path from Gy)...v, to Fw)...W» labeled p. 


Case 2: For some k, 0<k<m-~l, oy is rewritten into o,,). We distinguish four subcases: 


Case 2a: Fujr...uy7 is rewritten as fa;é...a,é, then as ag using an equation fa)x...a,x=ax in Ey 


p 
and then as GV) p...Vgp- Clearly (Fujr...up7, fay€...a,§)~<(Fujr..upt, Gv jp...Vgp), so by the 
induction hypothesis there is a path from fa)...a, to Fwy...Wp labeled & (sec Figure 3-5). Since 
<faj...4,, 2€ Fy, there is a path from a to Fw)...w, labeled . We also have 


{ak, Gv p...Vgp)<(FujT...Uyr, Gv) p...VgP), so by the induction hypothesis Gv}...Vq is a node of H 
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and there is a path from Gy)...vg to Fwy...Wp labeled p. 


Case 2b: Fuyr..u,7 is rewritten as ag, then as fa,g...a,€ using an equation fa)x...a,x=ax in Ey 


p 
and then as Gvjp...vgp. Clearly (Fujr...upt, a€)<(Fuy7...u,7, Gvjp...vgp), so by the induction 
hypothesis there is a path from a to Fw)...wp labeled € (sce Figure 3-6). Since <fay...an, aE Fy, there 
is a path from fa)...a, to Fwy...Wy labeled €. We also have 

(fa) €...a,€, Gv) p...Vgp)<(FujT...U,7, Gv p...Vgp), so by the induction hypothesis Gv}...Vq is a node 


of H and there is a path from Gyj...vg to Fw j-Wp labeled p. 


pT is rewritten as ag, then as big using an equation ax=bix in Ey and then as 


Gvjp...Vgp. Clearly (Fu)r...uyt, ag)~<(Fuyr...up7, Gv) p...Vg), so by the induction hypothesis there 


Case 2c: Fujr...u 


is a path from a to Fw JW labeled & (see Figure 3-7). Since there is a directed arc (b, a) labeled i, 
there is a path from b to Fwy...w, labeled ig. We also have 
(big, Gv p...Vgp)~<(FujT...Uy7, Gvjp...¥gp), so by the induction hypothesis Gv}...Vq is a node of H 


and there is a path from Gv}...vg to Fw)...w, labeled p. 


Case 2d: Fuyr...uyr is rewritten as big, then as ag using an cquation ax=bix in Ey and then as 
Gyjp...vgp. Clearly (Fujt..Upt, big)~<(Fuj7...up7, Gvjp...Vgp), So by the induction hypothesis there 
is a path from b to Fwy...Wp labeled ig (see Figure 3-8). Now there is a node c on this path such that 
the subpath from b to c is labeled i. Since there is a directed arc (b, a) labeled i, by Rules FE), Fy, T we 
have <a, c>€E};. Thus there is a path from a to Fw,...w, labeled €. We also have 

(ag, Gvjp...Vgp)~<(Fujt...U,7, GV )p...Vgp), so by the induction hypothesis Gvj...vg is a node of H 


and there is a path from Gvyj...vg to Fw}...w, labeled p. 


This concludes the Proof of the Claim, so we are done, 


We remark here that Theorem 3.1 can be strengthened using the axiomatization of [54] for FD’s 
and INI)’s (see Subsection 2.1.1). Specifically, we can show that we need not use Rule 1, in the 
construction of H. To sce this, consider the following sets of dependencies: 

Fyy={uy...uy—u | uy, k=1....,.p and u are nodes of H such that <Fuj...u,, u>€ Ej}. 

Tj = {uy...Ug€vj...Vq | Uy.vy, are nodes of H such that there is a path from v, to u, labeled 7, k=1,...q, 
where 7 is a term over the i's}. 

Here we assume that Rule I; was not used in the construction of H. Clearly 2CF},UI,;. Morcover, it 


is straightforward (but lengthy) to verify that F),U],, is closed under the rules of [54] (using the fact 
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that H is closed under Rules T, Kj.5, 1)-2). Therefore, 2 A)..A,—A iff ay.aj—a is in Fy, and 
ZF B)...B,,CA}...A,, iff b)...b,,Cay...a,, is in I). This stronger version, however, is not necessary for 


our purposes. 


3.2 Typed IND’s and Acyclic FD’s 


Suppose we are given a sct 2 of FD’s and typed IND's, over database scheme D={R,[U,]: 
the graph notation of Section 2.1.1). The FD’s and IND’s in 2 are represented in Hy as explained at 
the beginning of this Section. We use a_ different label JK for each typed IND 


RAE Ae Rj:Ay..Ag in &. 


The fact that 2 contains only typed IND’s induces a special structure on the graph H (of Theorem 
3.1), which we will now analyze. Consider the graph Fy of Section 2.1.1. This graph has a node a for 
cach attribute A in U, and a group of red arcs (a),a),...,(a,,a) labeled f for cach group of red arcs 
(aka¥)....,(aka®) labeled f of Hy. We define two partial functions type, node on the set of terms (over 
the a's and the f s). If r is a term, éype(r) is the name of a relation scheme in D and node(r) is a node 


of Fy. The functions type, node are defined inductively as follows: 
1, For cach attribute A of R,, pea’) = Ry, nodea*) =a. 


2. 1f tpet))=Ry and node(r;)=v; for j=1,....n, where there is a group of red arcs 
(Vz,V),(V,,¥) labeled fin Fy, then gype(fry...7,)=R,, node(fry...7,)=V. 


The crucial property of H (in the case of typed IND’s) is given in the following 


Lemma 3.1: The functions type, node are defined on all terms that appear as labels of nodes of 


H. Moreover, 
1. If fry...7,, is a node of H then for j=1.....n we have ype t))=Ry and noder})= Vj, where there is a 


group of red arcs (v),V).....(V,,V) labeled Fin Fy. 
2. If <u,v> is an undirected arc of H then type(u) = type(v) and node(u) = nodée(v). 
3, If (u,v) is a directed arc of H labeled i!* then type(u)= Rj. type(v)=R, and node(u) = nodev). 


Proof: Straightforward simultancous induction on the number of applications of rules that 


produced a node (arc) of H. I 
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Assume now that Fy is acyclic: It is not hard to sce that in this case cach node of Fy can be the 
image (under node) of at most an exponential number of terms (in the size of Fy). Therefore by 


Lemma 3.1 the size of H is at most exponcntial, and by Theorem 3.1 we obtain 


Corollary 3.1: The implication problem for acyclic FD’s and typed IND’s is decidable. If 


In particular, implication of an FD can be tested in exponential time, and implication of an IND 
can be tested in nondeterministic exponential time (by guessing appropriate paths of H). Whether 


these bounds can be improved is an open question. 


We remark here that if 2 is a set of FD’s and typed IND’s over database scheme D and 2Fo, 
where o is an IND, then o must be typed. This follows casily from ‘Theorem 3.1 and Lemma 3.1, but 
can also be seen dircctly as follows: Consider a database d which associates to cach relation scheme 
R, of Da single tuple t,, where tyLAj] =j, A;EU. Clearly d satisfies all FD’s and all typed IND’s (over 


D), but violates any IND which is not typed. 


3.3 Inference of FD’s under Pairwise Consistency 


Let 2 be a set of FD’s over database scheme D and let PC()) be the set of all typed IND’s over D 
(recall that PC(D) expresses the fact that the database is pairwise consistent). By the remark at the 
end of the previous Section, PC(D)UZ does not imply any new IND's, so we need only be concerned 
with implication of FD’s. Furthermore, observe that if a database d over D satisfies PC(D), then 
Ry: Ay...A,—A holds in relation Ry iff Rj:Ay...A,—A holds in relation R;, where R,[U,]}, Ri[Uj] both 


contain attributes Aj,....A,,A. For this reason we can suppress relation names from FD’s. 


In the presence of only typed IND’s, every term that appears as label of a node of the graph H (of 
Theorem 3.1) is of the form Fal...ap, where pa Fat...ak)=Ry: this is an easy consequence of Lemma 
3.1. Now suppose we have pairwise consistency, there is a node labcled Fah...ay, and A,, appears in 
relation scheme Rj, m=1....,p; then there is a directed arc labeled i) from ak to a. Thus, by Rule I, 
(and an easy induction on the structure of F) there is a node labeled Fal...al,, This observation allows 
us to represent the graph H more succinctly, by having only one node a,, for cach attribute A,, and a 
node Faj...a, for each term 


Fa...ap that appears as a label of a node of H. 
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This representation can be further simplified if the FID’s in © are all unary, In this case all we need 
to observe is that the terms that appear as labels of nodes correspond to paths in the graph Fy (recall 
that Fy is a directed graph with a node ay, for cach attribute A,, and an arc (a,,a)) for cach FD 
Ay—+Aj in Z). Moreover, it is not difficult to see that a/] such paths will appear as labels of nodes. We 


now give the formal details of this representation. 


Let V be the set of nodes of Fy. For cach attribute A,,, let T, _ be the following (possibly infinite) 
= m Am 


directed tree: 


the sect of nodes Pa AG V* is the set of all paths in Fy which start at a,, (denoted as sequences of 


Mm 
nodes); 
the set of arcs is {(say, sa,aj) | sEV*, Sax€P A Ay AjE Zh. 


Let P=U Ame UP An Define E to be the smallest sct of undirected arcs on P which contains <s,s> 


for all s€P and <a,aj, a;> for all Ay—+A, in 2, and is closed under the following rules: 
1. Propagation: If <sa,, s‘a,>€E, then <sa,aj, s‘a,a>CE for all Ay—+A, in 2. 


2. Pscudo-Transitivity: If <s),89>, <s2,83> are in E, SEPA, and there is a relation scheme in 
D which contains Aj,A>,A3, then <s),83> is in E. 


By the preceding remarks and Theorem 3.1, we have 


Lemma 3.2: PC(D)U TE AyA; iff <s,aj>€E for some s€P,.. | 


Example 3.1: Figure 3-9 has an example where D= {Rg[A,Q)Q>B], Ry[AA]Qy], Ra[A;Q\A2Q5], 
R,[A,Q>B]} and & is {AQ}, A} Ad, Ap B, Q) A, Qo B}. In this case, PC(D)U ZEA—B. 


The "only if" direction of Lemma 3.2 can also be proved by a counterexample construction. 
Suppose <s.aj> is not in FE, for any sin P Ay We will construct a pairwise consistent database d over D 
which satisfies the FD’s in 2 but violates Ay Aj. 

For cach attribute A,, in U the domain of A,,, S Ag’ Consists of all functions LP aA {0,1} such that, 
if <s,s DEE, s,8°€ Pal’ then fs)=f(s). 

Lect U, be Aj...A,. We construct a relation r, over R[U,] as follows: A tuple fil GED, ) is in Tr, 
iff, for any s in Pan s’in Pay (l<«,A<p) with <s,s>€E, we have f.(s)=fq(s). 

It is casy to sce that the database d consisting of the relations r, satisfies the FD’s in 2 (by the 


definition of the set E). We also claim that d is pairwise consistent. The key observation is that, if 
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Ay A 


Ky Kg 


feytng for which f,(s)=fc(s’) whenever <s,s >€E (B.C in Ag eAn,): Finally, one can verify that if 


is any subset of U,, then the projection of r, on Ng An consists of exactly those tuples 
q 


<s,a> is not in EK, for any $s in Pay then d violates Ay A}. 


The above construction produces in general an uncountable counterexample. Observe, however, 
that if = is acyclic then cach P Wea is finite, so the counterexample is finite. It follows that for acyclic 


unary ID's under pairwise consistency, finite implication coincides with (unrestricted) implication: 
Theorem 3.2: The class of acyclic unary D's under pairwise consistency is finitely controllable. f 


We now make some simple remarks about the set of undirected arcs E. Observe that, if <s), s.>€E 
and s,s‘, $58" are in P, then <sys¥s,s>€E. This is an casy consequence of Propagation. Also, if 
Cas}, aS,>€E and sas}, $aSz are in P, then <sas), sas)>€E. To sec this, suppose s is s’b, where b is a 
node such that BA is in &. Then <ba, a>€E, so by Propagation <bas), as, >€E. Similarly 


<bas,, aS>>€E. Then by Pscudo-Transitivity <bas,, bas)>€E. We are now ready to prove the main 


result of this Section. 


Theorem 3.3: The implication problem for unary FD’s in the presence of pairwise consistency is 


undecidable. 


Proof: We reduce the uniform word problem for semigroups (Thue systems [50]) to implication of 
u-FD’s under pairwise consistency. We assume that we are given a set S of word equations of the 
form aa; = o,; the problem is to determine whether Sa,a)=a3. Recall that this happens iff the 
string a3 can be obtained from the string aja, by successively replacing a substring w, by a substring 


W, where w; = Ww) (w)= W)) is an equation in 8. 


For cach given equation in S, say a,j = ay, We include in our database scheme relation 
schemes Rj.7, Kj-, Ry-3, L, Mj-2, as shown in Figure 3-10. The directed arcs represent unary FD's. 
There are two gencral-purpose attributes X,Y. For cach a,, there are two attributes A,,,B,,, and for 
cach equation there is a sct of attributes Q)-.. 
If the equation fo be inferred is ajay = a3, then we include in the database scheme relation 
schemes R}.7, Ky-9, Rj-3, L, Jy-3 and FD’s as in Figure 3-10 (where now A,B; are Aj, By, Aj.B; are 
A>,B>, Ay,B, are A3,B3, and we have used attributes Qj-g). We will show that the u--D Q¢-+Q is 


implicd iff SE aja,=a3. Let P be a sct of nodes and E a set of undirected arcs as in Lemma 3.2. 
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Claim: The undirected are <xaybyyxayboy, xa3b3y> is in E iff Saja) = a3. 


Proof of Claim: We will give a characterization of the set E. Let ¢ be an equation Oj; = Oy in S, 
and suppose ¢ gives rise to relation schemes Ry_7, Ky.>, Ry.3, L, Mj,-2, as in Figure 3-10. Consider the 


following sets of undirected arcs which correspond to ¢ (all these arcs are in E): 


Ej: 
<xaj, a), 
<a;b;, bp. <qyb,, bp, <ajb;,, q)b;>, 
<biy, y>, <qay, y>, <Diy, Gay, 
<yX, XD, 3X, X, <YX, G3X>, 
<xaj, a>, <q4aj, a>, <xa;, 444), 
<diy, Y>, <Q6Y, Y>, <DiY, Gey, 
<xay, ay, 
<a, by, b,>, <q7b,, b,>, <a,b,, q7b,>, 
ES: 
<xa;b;, q,b;>, 
<ajbiy, qay>, <q ,biy, qny>, 
<DiYX, G3X>, <qzYX, 43x, 
<yxaj, q4aj>, <q3Xaj, 44a}, 
<ajbiy, dey, <qsbyy, 6y>> 
<xayb,, q7b,>, 
<a, byy, dgy>, <q7b,y, day. 
E3: 
<q byyx, q3X?, <qoyxa;, q4aj>. <q3xajbj, q5b;>, <qgajbiy, dey» 


<xa,byy, Ggy>. 


Ey 
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<q [biyxaj, 948), <qoyxajbj, q5b)>, <q3xajbiy, deY?- 


ES: 
<q bjyxa;b;, 95b}>. <qoyxajbiy, Gey: 


3¢, 
ES: 


<xajbjyxajbyy, dey. 


E;: 
<Q6Y, Agy>. <xXayDyy, Ay, <xajbiyxajbiy, dgy>, 
<xajbiyxajbyy, xa, byy>. 


It is not difficult to see that for cach equation c in S, k=1....,7, Ey is contained in E (compare with 
Figure 3-9). 
Now consider the following sct of arcs E’: Let <s;,8)> be a member of some Ex (for some e,k), and 
suppose s’ is obtained from s by successively replacing a subsequence xajbiyxajbjy by a subsequence 
xa, byy (or vice versa), where a,a;= a, is in S. If 515, sys’ are in P, then put <s1s, $98 in E’, Also ifs, s” 
are in P, then put <s, s> in E’. 


By the remarks immediately preceding the statement of Theorem 3.3 (and the fact that E,G E) we 
have E°CE. Furthermore E’ contains the arcs initially put in E, and clearly it is closed under 
Propagation. It is also straightforward (albcit a bit tedious) to verify that E’ is closed under Pseudo- 
Transitivity. Therefore ECE’, and thus E=E’. The Claim now follows from this characterization of 


E. 


To finish the Proof, observe that Q¢—+Q is implied (Lemma 3.2) iff<xa,b)yxa by, xa3zb3y> is in E 
(cf. Figure 3-10). il 


We will now show that there is no k-ary axiomatization for implication of u-FD’s in the presence 


of pairwise consistency. 


Let D be a database scheme and 0 a sct of sentences about PD (for instance, FD’s and IND’s). An 
axiom system for implication of sentences in © is k-ary [16] iff it is universe-bounded (i.e. only 
attributes in D arc mentioned) and every rule has at most k antecedents, for some fixed integer 
k. Observe that the axiom system of [54] for implication of FD’s and IND’s is not k-ary, because Rule 


10 violates the boundedness condition (sce Subsection 2.1.1). 


Let 2CO, o in O. We say that 2 is closed under implication iff whenever DE o we have o€%. Also, 


= is closed under k-ary implication iff whenever Z’Fo, where 2’C¥, |2’|<k, we have o€. ‘The 


following characterization for the existence of k-ary axiomatizations is taken from [16]: 


Proposition 3.1: There is a k-ary axiomatization for implication af sentences in © iff whenever 


2C6@ is closed under k-ary implication, % is closed under implication. fl 


Theorem 3.4: There is no k-ary axiomatization for implicatian of u-FD’s under pairwise 


consistency (we consider here axiomatizations involving arbitrary FID’s and IND’s). 


Proof: Let U be {A,Aj.....Ay,Q),...,Q,,B} and let D be a databasg scheme over “U. consisting of 
relation schemes Rg[AQ}...Q, B], Ry[AA;Qy) RIA}-1Q;-A,Qi}. j=2,....k, Ry, [A,Q,B]. Let ® be the 
following set of FD's over D: Ry:A— Aq, Ry: Ajj Aj j= 2sk, RiQ) 1 OAR I= 2.-5k, 

Rye LAR OB, Ry 4 'Qh > B, Ro:Qj-B, j= 1.....k (cf Figure 3-9 for the case k= 2). 


Consider the sect ®’ of FD’s which are consequences of ®. The sct ®’ can be constructed by 
closing ® under Rules 1,2,3 of the axiom system of [54] (sce Subsectign 2.1.1). Let 2 be ®’UPC(D). 
We will show that & is not closed under implication (of FD’s and INP’s), but is closed under k-ary 


implication (of FD’s and IND’s). Theorem 3.4 will then follow by Proposition 3.1. 


For the first part, it is not hard to see that ZF, where o is Rp:A~+B (cf. Figure 3-9). Since o is 


not in X, we are done. 


For the second part, suppose =’Fo, where 2’C 2, |2'|<k, o is an IND or an FD. We will show 


that o isin =. 
Ifo isan IND, then o must be typed, by the remark at the end of Scction 3.2. Thus o is in &. 


Suppose now o is an FD Ry:Cy...Cg+Cp, where 0<p<k-+1 and all the C;’s are in U. Since all 
the FD’s in ® arc unary, it easily follows from Theorem 3.1 that ZF R,:C, Co, for some m, 
l<m<q. We will argue that Ry:Cn—>Co is in ®; from this it easily follows that o is in ®’, ie. it is in 
>. 


Consider the nodes c,,, Cg of the graph Fy (cf. Figure 3-9). If there is no directed path from cy, to 
Cg, then we can construct a relation r over U which satisfies all the FID’s in ® (without their relation 


names) but violates C,,—>C. We can then project r over the R;’s to obtain a database d over D which 
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satisfies E (and thus also £:) and violates Ry:Cy~+Cp 

Thus, there is a directed path from Cy, 0 6g. Since Cy, Cg also appear ii the same relation name, it is 
_ easy to check that R,:C,,—+C is in ©, unless Ry:Cy—rCy is Re:A~+B. However, since |Z1<k one of 
the FD's Ry:A—+A), RyAp )—+Ay J= 2k, Ry: Ay—+B enuat be miming from 2° and therefore we 
saaees Save 7 FHA 28 ieee rt © ne eee pa ee Ew Fz) oe emt Oe 7 


proof. & 


Figure 3-2: Graph rules for FD’s and IND’s 
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Figure 3-3: Basis case 


Figure 3-4: Case 1 
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Figure 35: Case 2a. 
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Figure 3.7: Case 2c 


Figure 3-8: Case 2d 
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Figure 3-9: Example of FD inference under pairwise consistency 
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Figure 3-10: Gadgets for Proof of Theorem 3.3 


53 


Chapter Four 


Finite Implication of FD’s and Unary IND’s 


A natural question is whether our equational approach can handle finite implication of database 
constraints. Ideally, we would like to be able to replace by F,, throughout Theorem 2.1, It is 
easily scen that the same arguments can show that (iii)=>(ii) and (ji)=>(i) in the finite case (the 
constructions given map finite counterexamples to finite counterexamples). The argument for 
(i)=>(iii), however, breaks down, because it is based on the existence of a complete proof procedure 
for implication (namely the chase) and such a proof procedure cannot exist for finite implication 
[54, 19]. As a matter of fact, the same syntactic nature of the proofs of Theorems 2.3 and 3.3 prevents 
us from proving undecidability of finite implication. The weaker proofs of [54, 19], because of their 


semantic nature, can easily be done for the finite case. 


However, Theorem 2.4 also holds for the finite case: By the discussion above one can see that F= 
can be replaced by Fg, in Theorem 2.1 if we have a finitely controllable class of FD’s and IND’s, i.e. 
a class where Fg, is the same as F. Acyclic IND’s and FD’s provide an casy example of such a 
class, because the chase in this case constructs a finite counterexample if the implication docs not 
hold. Another example of a finitely controllable class is acyclic unary FD’s under pairwise 


consistency (Theorem 3,2). 


If Fy, is different from F, we might still be able to handle the finite case if there is a complete 
proof procedure for finite implication. In this Chapter we provide such a class: we show that there is 
a complete proof procedure for finite implication of FD’s and unary IND’s. This proof procedure is 


then used to prove a (weaker) analogue of Theorem 2.1. for finite implication of FD’s and u-ID’s. 


Let & be a set of FD’s and u-ID’s over a database scheme P containing a single relation scheme 
R[U]. If o is an FD or u-ID, we will show that 2,0 iff o can be proved from Z using the 
following sct of rules (*). We use X,Y to denote scts of attributes. We denote a u-ID ACB 


alternatively as BDA. 


54 


Rules (*): 

1. (reflexivity) AA, AE. 

2. (augmentation) from X—+A derive XY—A, ACU. 

3. (transitivity) from X—+Ay, k= 1,...,n, Ay..A,—A, derive XA, AEU. 
4. (u-ID reflexivity) ACA, AEUL 

5. (u-ID transitivity) from ACB and BCC derive ACC, A,B,CEU. 


6. (cycle rules) For every odd positive integer m and attributes A,, 
from Ag A, and A, DA, and...and Ay j-A,y, and Ay,2Ag 
derive A,—+Ag and A,DA, and...and Ay, Ay, and AgDAy- 


Rules 1,2,3 are the standard rules for FD’s [5] (written in our notation) and Rules 4,5 are the 
specialization of the general IND rules of [16] to u-ID's. Thus, Rules 1-5 are sound for general 
databases (infinite as well as finite), A simple counterexample construction shows that Rules 1-5 are 
also complete for unrestricted implication of FD’s and u-ID’s. More spccifically, FD’s and u-ID’s 


decouple in the case of unrestricted implication. 


Proposition 4,1: Let 2; be a set of FD’s and 2, a set of u-ID’s. 
1, &,U 2) X—A iff 2p X—A. 
2. Z-UZ ACB iff 2),-=ACB. 


Proof: The "if" direction is obvious in both cases. We will show the "only if" direction. 


1, Suppose Zp docs not imply XA. Let X* ={B | BEY, =F X— B}. Consider a relation r 
consisting of tuples t,, k=0,1,2,.... where to[B]=0, BEU, and for k=1,2,..., t,[B]=k-1 if BEX* and 
t,[B]=k otherwise. It is casy to see that r satisfies the FD’s in 2} (the only tuples to check are to,t)), 
and obviously r satisfies ai] u-ID’s. Now since A is not in X*, r violates X-+A. Therefore, ZpU2, 


does not imply X-+A, 


2. Suppose 2 docs not imply ACB. I.ct G; be a directed graph which has a node a,, for cach 
attribute A,, in U and a directed arc (a;,a,) for cach u-ID A,CA,; in 2}. By our assumption, there is 
no directed path from b to a in G, (cf. Rules 4,5). Thus, we can assign to cach node u of G; a number 


cu) so that c(u)<c(v) whenever there is a directed path from u to v, and c(b)>cf{a) (this can be done 
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by a topological sort of the dag of strongly connected components of G, [2]). Now consider a relation 
r consisting of tuples t,, k=0,1,2,..., where for A,, in U we have t,[A,,]=k+ c{a,,). Clearly r satisfies 


all u-[D's in 2; and violates ACB. Moreover, r satisfies a// FID’s, so 2)-U 2, docs not imply ACB. 


As a matter of fact, the cycle rules are not sound for infinite databascs: Consider a relation r over 
relation scheme R[AB], consisting of tuples t,, k=0,1,2,.., where fA]=k, t,[B]=k-+1: clearly r 
satisfies B—A, ADB, but violates BDA. On the other hand, a simple counting argument shows that 
the cycle rules are sound in the finite case. Let |r[A]] denote the cardinality of column A of relation 
r. If the antecedents of a cycle rule hold in r we have |r[Ag]]=[1[Aq]|=...=[|t[A,, Il. Now if a finite 
relation r satisfies |r[A]|=|r[B]| and A-+B, it easily follows that it satisfies BA. Similarly, from 


Ir[A]| = [r[B]| and ADB it follows for finite databases that BDA. 


In order to analyze the rules (*), we use a graph notation for dependencics similar to the notation 
of Subsection 2.1.1. If Z is a set of FD’s and u-ID’s, Gy is a graph which has a node a,, for each 
attribute A,,,, a red arc (aaj) for each FD Ay A, in =, and a black arc (a;,a,) for each u-ID ACA; 
in 2. If between nodes u,v of Gy we have red (black) arcs in both directions, we replace them with an 
undirected red (black) edge. The transitivity and cycle rules imply that, when Ay A; (AyDA)) 
corresponds to some arc in a directed cycle of Gy, we can infer Aj Ay (AjDA,). In fact, if = is 


closed under the rules (*) then Gy has a good deal of structure, as can be easily verified. 


Proposition 4.2: If 2 is a set of FD’s and u-ID’s closed under the rules (*) then Gy has the 
following properties: 
1. Nodes have red (black) self-loops. The red (black) subgraph of Gy is transitively closed. 
2. The subgraphs induced by the strongly connected components of Gy are undirected. 
3. In each strongly connected component of Gy, the red (black) edges partition the set of nodes into a 


collection of node-disjoint cliques. 
4. If Ay...A,-7A is an FD in and aj,...,a, have a common ancestor u in the red subgraph of Gy, 


then Gy contains a red arc (u,a). 


By a topological sort of the dag of strongly connected components of Gy we can assign to each 
component a unique scc-number, smaller than the scc-number of all its descendant components in the 
dag [2]. ‘Thus every node u in the graph Gy of Proposition 4.2 belongs to a unique maximal red 
(black) clique and a unique strongly connected component. Let scc(u) denote the scc-number of the 


component of node u. 
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Figure 4-1 illustrates an example of such a graph Gy. There are four strongly connected 
components, cach a black clique, with all black arcs present from components with smaller to 


components with larger scc-number, lhe red cliques and red arcs are shywn explicitly. 
We now give aconstruction which lics at the heart of our completenass proof. 


Lemma 4.1: Let = and Gy be as in Proposition 4.2 (i.c,, closed unde; the rules (*)). Let the dag of 
strongly connected components of Gy be topologically sarted, so that cach component has a unique 
scc-number. We can construct a finite relation r such that: 

1. The u-FD A-B holds in r iff it is in 2. Also all FD’s in & hold in r. 

2. The only repeated symbol in cach column of r is 0, and the symbols in r[A] are exactly the integers 
from 0 to [r[A]l-l. Moreover, |r[A]|>|r[B]} iff scc(a)<scc(b) (thus, the u-I ADB holds in r iff 
scc(a)<scc(b), and all u-ID’s in © hold in r). 


Proof: First put in ra tuple of all 0°s. Process cach strongly connected component of Gy in turn, in 
order of increasing scc-number. Begin processing a component by processing in turn each of its red 
cliques. To process a red clique x, add a tuple with all 0°s in the columns of the attributes of « and of 
the attributes in all red cliques that are descendants of « in the red subgraph of Gy. For now leave all 
other positions blank. 

For every red clique « keep a count of the number of 0’s in one of its columns (by the way the 
construction proceeds all columns of « have the same number of 0's). Now that one tuple was added 
for each red clique in the component, in order to terminate processing the component repeat certain 
of the tuples just added, so as to make the counts of all cliques in the component equal, and strictly 
greater than the counds of the cliques of the previous component. This is possible because no red 
clique is a red descendant of another red clique in the same component, or in a component with 
larger scc-number. Once a component is processed, no further 0°s are added in its columns and its 
counts no longer change. 

After adding tuples for all red cliques in all strongly connected components, we examine in turn each 
column. If the column has s blank positions, we fill them in with the numbers 1 to s, without any 


repetitions. We illustrate the construction in Figure 4-1. 


Now it is easy to check that conditions 1,2 hold: 
1. No u-FD in 2 was violated during the construction. Furthermore, all u-FD’s not in 2 were 


violated. To sce this, observe that if A-B is not in 2, then the tuple inserted for the red clique of A 
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and the initial tuple of all 0's disprove AB. 

We must also verify that all non-unary ID's in & are satisfied. Suppose Aj...A,—A is an FD in 2 
violated by r. Since the only repeated symbol in cach column is 0, there is a tuple t of r such that 
t(A,J=0, k=1,....n, (ADO. Now t was inserted in r while processing a red clique «, so all 0's in t 
correspond to attributes that are functionally determined by every attribute B of x. Since = is closed 
under Rules 1,2,3, it follows that BA, is in 2, k=1,....n, and also BA is in Z. But then r satisfies 
BA, and since t{B]=0 and there is an initial tuple of all 0’s, we obtain t{A]=0, which is a 
contradiction. 

2. By the way r is constructed, the final counts are strictly increasing with the scc-numbers, and are 


equal in all columns of a strongly connected component. ll 


We will now prove our main result: 
Theorem 4.1: ‘The rules (*) are sound and complete for finite implication of FD’s and u-ID’s. 


Proof: We have already argued for soundness, so it remains to show completeness. Let 2 be a set 
of FD’s and u-1D’s closed under the rules (*), and Iect o be an FD or u-ID not in 2. We will exhibit a 


finite counterexample relation r which satisfies 2 but violates o. 


Case 1 (o is an FD): 

If o is unary, then the relation constructed in Lemma 4.1 is the desired counterexample. If o is not 
unary, we can use a construction similar to that of Lemma 4.1. In this case the counterexample 
relation is the union of two relations fo,r}. 

Let o be XA. The first relation rg is a two-tuple relation with one tuple all x’s and the other having 
x’s only in the attributes that are functionally determined by X in the sect 2. The remaining positions 
of this second tuple arc initially left blank. 

The second relation r, contains the symbols 0,1,... (but not x) and is constructed so that the union of 
Tg and r; has the right number of repetitions of the symbol 0 in r, to satisfy all u-ID's in 2. The 
construction of r; parallels the Proof of Lemma 4.1. The only difference is that now the counts are the 
number of 0’s and x’s in the union of the two relations. When the correct number of blanks have been 
inserted in all columns, i.c. all columns in a strongly connected component have the same count and 
count increases with scc-number, then the blanks can be filled in as in the Proof of Lemma 4.1 and all 


u-ID’s in © are satisfied. 
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Case 2 (o@ is au-ID): 
Let o be CDI). Repeat the construction in the Proof of Lemma 4.1, with the following modification: 
if the column for attribute A has s blank positions, fill in the blanks with the numbers 1 to s if there is 
no black arc (a,d) in Gy; otherwise, fill in the blanks with 1,...,s-1, x. The relation thus constructed 
satisfies the FD’s in &, by the same argument as in the Proof of Lemma 4.1. To see that the u-ID’s in 
2 arc also satisfied, observe that ADB is violated iff cither 
(i) scc(a)>scc(b), or 
(ii) sce(a)<scc(b), there is no black arc (a,d), and there is a black arc (b,d). 
By the properties of Gy, this means there is no black arc (a,b), ic. ADB is not in Z. Finally, it is clear 
that CDD is violated. 


Sce Figure 4-2 for an example of this construction. fl 


We remark that Theorem 4.1 leads easily to a polynomial-time algorithm for finite implication of 
FD’s and u-ID’s [44]. We will now use Theorem 4.1 to prove an analogue of Theorem 2.1, this time 


for finite implication of FD’s and u-ID’s. The notation is taken from Chapter 2. 


Theorem 4.2: In cach of the following two cases, (i),(ii),(iii) are equivalent: 
FD Case: 
i) Dg, Ay-Ap A. 
ii) Fg in VeEgT* Mp t[x1/a}X,....X,/4,X] = ax. 
iii) 855, VET Mp T[X,/0y,...,X,/a,]= a. 
u-ID Case: 
i) D5, BCA. 
ii) EsF gy V-ET* (M)) ar = bx. 
iii) Ss$F 5, VET" (M,) t[x/a]=B. 
Proof: ‘The implications (iii)=> (ii), (ii)=> (i) can be proved by the same argument as in the Proof 
of Theorem 2.1. The reason is that the constructions we give map finite counterexamples to finite 


counterexamples. 


(i)=> (iii): Suppose ZF >_,0, where o is an FD or u-ID. By Theorem 4.1, there is a proof of o from 
2 using the rules (*). Let z be the number of steps of such a proof. We show both the FD and the 


u-ID Cases by simultancous induction on z. 
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Basis: 2=0. The conclusion is straightforward. 
Induction Step: We distinguish six cases, depending on the last rule which was applied to prove o. 


Rules 1,2 Straightforward. 


Se 


Rule 3 This means the FD’s A}...A,7B,, k=1,....m, B,...B,,—A can be proved from & (in less 
than z steps); Rule 3 is then applicd to derive Ay...A,—A. By the induction hypothesis, &y finitely 
implies VET" (Mp T[X}/a,X,....X,/a,X] = b,x, k= 1,....m, and also Sy finitely implies 
VET" (Mp T[X ,/D)X,....X_,/D,,x] = ax. Thus, 85. finitely implies 
Vv ame Td (Mp T[X | 7 [X17 ay XyesXpZAQX], oo Xp / TX Xo-% p/p X]] = ax, Le, 


Sy in VT (My TX 1/apXonnXp/ApX] = aX. 
Rule 4 Straightforward. 


Rule 5 Similar to Rule 3. 


Rule 6 Now the dependencies Ag Ay, AyDA2,.., Am-j—?Am: Am2Ao (m odd) can be proved 


from & (in less than z steps); then by a cycle rule we derive Ay—+Ap. 


Let A be a finite model of &y. By the induction hypothesis A satisfies pgay = ay, 7,0] = O,..., 
Pm-1%m-1 =%m> Tm%m = &» Where Pye (Mp), TET" (M;) (we write ra as a shorthand for 7[x/a]). 


We will show that there is some p’ in J* (Mp) such that A satisfies p’a) = ap. 


Observe first that A satisfies pot Ppp -1---73P27T 1] = a (concatenation denotes composition). By 
the commutativity conditions (5) of 8, potmPm-1--73P271= PoPm-1-P2Tm-737], 80 A satisfies 
POP m-1P2T m--7T3T1 Ay = Ay. NOW put Poppy 7---P2 = Ps Tye T 37] = Ts Ty T 37 Oy =A, 

We now have ra,;=a@, pa=ay,. We will argue from these two equations that there exists some p’ in 
J*(M,) such that A satisfies p’e,=a. It will then follow, since p,,7..0)@ =p, that A satisfies 
Pm-]--P2P @] = A. 

Consider the set K= {p*a, > k>0} (pk is p composed with itself k times). Since A is finite, K is 
finite, and therefore there exists a /east integer q such that p%a,=p*a,, for some s greater than q. We 
will first argue that q=0. Assume on the contrary that q>1. By commutativity, 
rp%a,=p%ra,;=pla=pt'pa= pt tay, and similarly rpSa,;=p*taj. But this means 


p* ta, =p*1a,, which contradicts the choice of q. 
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- Since q=0, eee butnow a=naymeptayme'ver=p") pa=p” lay ie. 
A satisfies 6 le, =a. This conchudes the proof | 


Ia cyl rule is appli to derive a wiD, wee na ent saiges a 


Figure 4-1: Construction of a finite counterexample relation 
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6 x 4 * yD 
rRA QA 

Figure 4-2: Relation that violates a wiD 
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Chapter Five 


Partition Dependencies 


5.1 Preliminaries 


Let D be a database scheme containing a single relation scheme R[U], U={Aj,...A,}. We can 
express database constraints as formulas of first-order predicate calculus with equality [32]. These 
formulas have a single relation symbol R of ARITY u which represents the relation R, and no function 


(or constant) symbols. 


Specifically, let us call atomic formulas of the form Rx)...x, relational formulas and atomic 
formulas x= y equalities. A formula is typed iff there are disjoint classes (types) of variables such that 
1. if Rx,...x, appears in the formula, then x, is of type k, k= 1,...,.u, and 


2. if x= y appears in the formula, then x,y have the same type. 


Definition 5.1: An embedded implicational dependency (EID [34]) is a typed sentence of the form 
VX} Xp- (pA... Ap = AY] Yq: (HA... Avg)h 
where each gy, is a relational formula, each y, is either a relational formula or an equality between 


two of the x,’s, and cach of the x,’s appears in one of the »,’s. 


Example 5.1: 
(a) Let U={Aj),A>,A,B}. The FD A}A,—A can be expressed as the EID 
Wxyxoxyx y% [(RxyxX9xyARX) xx y J=>X =x], 


(b) Let U= {A,B,C}. The MVD A>-B [62, 51] is equivalent to the EID 
Vaxyx’y”. [(Rzxy ARzx ‘y )=> Rzxy J. 


Now let r be a relation over a finite universe of attributes U, and let o be an EID. As one can 
easily observe, to decide whether ro we do not need to know the particular values appearing in r, 
but only the equalities between these values. As a matter of fact, all that is relevant about two tuples 


ts of r is the set of attributes on which they agree. We can capture this information formally by 


considering, for cach attribute A in U, the partition a, which is induced on the set of tuples of r by 
the values of rin column A: two tuples ts of rare in the same block of a, iff they agree on A. The set 


{a, | A€U} characterizes the EID's satisfied by r. 


Although the above observation docs not seem to take us very far regarding general EID's, it docs 
lead to an elegant algebraic formulation of FD’s [15, 60, 27]. Recall that partitions have a natural 


partial order <, and two natural binary operations *,+: Given partitions 7, a’ ofa set S, 


a<q iff for every block x of 7 there is a block x’ of 7’ such that xC.x’, 
mea ={x|x=yN24#D, yEu, z€q' }. 


a+a'={x|a,b€S are in x iff there is a sequence Xo,...,.X, such that 
x,€nUn’ for i=O,....n, a€xg, bEx,, and x44, )* for i=0,...,.n-1} 
Notice that a*2’ is the coarsest common refinement of 7," (in the sense of <) and a+‘ is their 


finest common generalization. Also *,+ are associative, commutative and idempotent (cf. Section 5.3). 


With the above remarks, it is easy to see that an FD such as AB—+CD holds in relation r iff 
TA°TRSAC AY 
or, equivalently, 
TATTR=TM A TP ACH 
or, still, 


TMC ApD)= TAST RT TOT: 


Thus, FD's can be expressed equationally using product and sum of partitions. It is then natural to 


investigate the expressive power of general equations one can write using °,+. 


Definition 5.2: 
a. The set of partition expressions over U, W(U), is the least set satisfying the following closure 


conditions: 


1. AE WCU), for A in U. 
2. If c,c’E W(‘U), then (c*e’), (e +e’) are in W(U). 


(*,+ are meant here as uninterpreted operator symbols) 


b. A partition dependency (PD) is an equation e =e’, where e,c’E W(U). 
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The above definition gives the syntax of PD’s, The semantics of PID’s are given below: 


Definition 5.3: 
a. Let r bea relation over U, S the set of tuples of r. For A in U, 
m,={x|ts€S are in x ifft{A]=s[A]}. 
Then L(r) is the set obtained by closing {7 , | ACU} under product and sum of partitions. 


b. Let c€W(U). The meaning of c in I(r), p,(c), is defined inductively as follows: 
Ll. pfA)=a,, A in UU. 


2. pj(cre’ )=p,(c)*u,(c’), 
pe+e’)=p,(c)+p,(e’). 


Relation r satisfies a PID e=c’ (notation: re=c’ ) iff p(e)=p,(e’). 
Observe that I(r) is actually a /attice [28], generated by the set {a, | ACU}. As a matter of fact, 


rc=c’ iff L(x) satisfies the equation e=c’ (with A interpreted as 7,4, A€EU). 


From Definition 5.3, we sce that we can use the formalism of PD’s to express an FD ABCD as 
the PD A*B=A*BeCeD. Clearly r= AB-CD iff reA*B=A*BeCeD (here and in the sequel we 


omit parentheses from PD’s wherever possible, for the sake of clarity). Partition dependencies of the 


above form, which are equivalent to FD’s, are of special interest; we call them FPD’s. 


In the remainder of this Chapter, we investigate various questions concerning PD’s. Section 5.2 
deals with the expressive power of PD’s, and compares PD’s to EID’s from this point of view. In 
Section 5.3 we give a polynomial-time algorithm for the implication problem for PD’s. Finally, in 


Section 5.4 we present a polynomial-time test for consistency of a database with a set of PD’s. 


5.2 Expressive Power 


We want to study what properties of a relation r can by expressed using sets of PD’s. From the 


definitions of *,+ and Definition 5.3 it it casy to see the following: 


1. rR-C=A¢eRB iff for any tuples ts€r, 
{{C]=s[C] iff (A]=s[A] and {B]=s[B]. 


2. rC=A +B iff for any tuples ts€r, 


66 


([C]=s[C] iff there is a sequence 5),....8, of tuples of r with t=sg, s,=s, and for 
i=Q.....n-1, s[A]=s;, [A] or s[B]=s;,. ,[B]- 
From observation (2) above, one sees that synunetric transitive closure can be expressed by a PD, 


as follows: 


Example 5.2: Consider a relation r representing an undirected graph. This relation has three 
attributes: HEAD, TAIL and COMPONENT. For every edge {a,b} in the graph we have in the relation 
tuples abc, bac, aac, bbc, where c is a number which could vary with {a,b}. These are the only tuples 
in r. We would like to express that: for each tuple t of r, t{COMPONENT] is the connected component in 
which the arc (t{HEAD], t{FAIL]) belongs. We can do this by insisting that r satisfics the PD 


COMPONENT = HEAD + TAIL. 


We now want to compare the expressive power of PD’s to that of previously studied database 
constraints, namely EID’s [34]. Let us say that an EID o is expressed by a sct E of PD’s iff for any 
relation r, reo iff rE. From the algebraic properties of *, the PD C=A°B is cquivalent to 
C=C*AeB A AtB=CeA*B, and therefore it is expressed by the set {C-++AB, AB—+C}. However, 
because of Example 5.2 above it should come as no surprise [4] that the PD C=A+B cannot be 


expressed by any set of EID’s: 


Theorem 5.1: Let U=ABC; the PD C=A+B cannot be expressed by any set of first-order 


sentences. 


Proof: Let = be a set of first-order sentences (with a single ternary relation symbol R as the only 


non-logical symbol) which expresses C=A+B. For k>1, Ict py be the following first-order formula, 


with free variables ts: 


"t{C]=s[C] and there is no sequence Sp,...,8, such that t=So, s,=s, and for i=9....,k-l, 
s[A]=s;, [A] or s{[B]=s; , |[B]" 

(it is easy to see how to write p, without tuple variables). Observe that the relation r in Figure 5-1 
(with ts as indicated) is a model for ZU{p,}: r= C=A+B so r=, and clearly rq,. Thus, any 
finite subsct of 2°= ZU{y,: k>1} has a model, and thus by the Compactness Theorem [32] 2” has a 


model, say r’. But this is a contradiction, since r’ satisfies 2 and thus r’ satisfies C=A-+ B, and on the 


other hand rq, for all k>1 and therefore it docs not satisfy C=A+B. F 
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On the other hand, an E1D as simple as an MVD cannot be expressed by PD's: 
Theorem 5.2: Let U=ABC; the MVD A+B cannot by expressed by any sct of PD's. 


Proof: Let E be a set of PD’s which expresses A+B (sce Example 5.1 for the meaning of this 
MVD). Referring to Figure 5-2, relation r, satisfies A+B, so L(r})FRE. On the other hand, 
relation r, docs not satisfy A+—B, so L(r,) docs not satisfy E. But this is a contradiction, because 


L(r,), L(r) are isomorphic, and thus they satisfy exactly the same PD's. #f 


5.3 The Implication Problem 


Given a finite set E of PI's and a PID 8, we want to know if EE 6, i.c. if 6 holds in every relation 
that satisfies E. We also want to know if EF*,,6, ic. if 6 holds in every finite relation that satisfies 


E. We first observe that these questions can be approached as implication problems for Jattices. 


Lemma 5.1: 
a. EF6 iff EE ,,,5, ie. iff 6 holds in every Jattice that satisfies E. 
b. EF gy iff EF ya find; ie. iff 6 holds in every finite lattice that satisfies E. 


Proof: 
a. (=): Suppose EF,,,5, and let r be a relation that satisfies E. Then L(r)FE, so 6 holds in L(r), and 


thus r satisfies 8. 


(=>): Suppose EE6, and Iet L be a lattice satisfying E. By the Representation Theorem for 


lattices, [28, 66], we may take the elements of L to be partitions of some set X. Thus, each A in U. is 
interpreted in L as a partition az, of X (and, of course, *,+ in L are partition product and sum 
respectively). Now consider a relation r over U containing a tuple t; for each element i of X (these are 
the only tuples in r), where t[A]= [A] iff i,j are in the same block of w,, A in U. Clearly r satisfies 
exactly the same PD’s as L. Thus rE, so by the hypothesis re 8, and therefore LE6. 


b. (<=): Observe, in the proof of the "if" direction of (a), that if r is finite then L(r) is also finite. 


(=>): Observe, in the proof of the “only if" direction of (a), that if L is finite then the set X can be 
taken to be finite, by the Representation Theorem for finite lattices [56]. Then the relation r is also 


finite. ll 
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Now EF,,,6 can be viewed as a (uniform) word problem, since a sct with two binary opcrations 


°,+ isa lattice iff the following sct of axioms (LA) is satisfied [28]: 
1.x+x=x, x*x=x (idempotency) 
2.xty=y+x, xey=y*x (commutativity) 
3.x+(y+z)=(x+y)+z, x*(yez)=(x*y)*z (associativity) 
4.x+(xey)=x, xe(x+y)=x (absorption) 


le., EF,,,6 iff 6 is implied from E U LA, We are going to show that Fig gy i8 equivalent to Fy, 


SO F tat fin CaN also be viewed as a word problem. 


at,fin 


In particular, Ict 6, be the FPD corresponding to an FD o (8, is A=A°B if o is A-B), and let 
Ey be the set of FPD’s corresponding to a sect of FD’s &. Since ro iff re5,, Zo iff Ey 6,. 
Thus, the implication problem for FD’s can be reduced, in a straightforward way, to the (uniform) 
word problem for idempotent commutative semigroups (structures with a single associative, 


commutative and idempotent operator). On the other hand, since X = Y is equivalent to X=X*Y A 


Y = Y°X, we can also reduce the above word problem to the implication problem for FD’s. 


We now present a polynomial-time algorithm for the (finite) implication problem for PD’s. 
Suppose we are given a set E of PD’s, and a PD e=e’: by Lemma 5.1, it suffices to test if EF>,,,e=e° 


(EF tat fine =e ). 


Consider the sect W(‘U) of partition expressions over ‘U, *,+: we define several binary relations on 


W(U). First, define <jq (identically less-than-or-cqual) inductively as follows: 
1, A<gA, A in WU. 
2. if p<igl, ASigr then p+q<igr. 
3. if pSigt or q<jgr then peq<jqr. 
4. if r<iqP, FSjgq then r<,qpeq. 
5. if t<jgp or r<jqq then r<jyp +a. 
(The intended meaning of <j is that p<j4q iff every lattice satisfics p<q, no matter how the A’s 


in “U are interpreted). 
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The relation <jq is reflexive and transitive [28,65]. Also, if p)<iqd). P2Sigd2, then 


Py +P2 Sig) +92 and py "Py Sig" Q- 
Now define =,q as follows: p=,,q iff both p<\qq and q<jqp. 


The relation =; is an equivalence relation, and in particular it is a congruence: i.c., if py =4gQ), 
P2=jgd2, then p)+Pp2=jg4) +42 and p)*p,=jgq|°q2. ‘Thus, one can define *,+ on the set of 


equivalence classes of =;4. The structure obtained this way is a Jattice [28, 65]. 


We now capture the effect of E. Define the following relation +); on W(‘U) : p> —;.q iffq can 
be obtained from p as follows: for i=0,...,n, substitute w; for some (zero or more) occurences of z,, 


where z; = w; (w;=Z,) is in E. It is easily verified that + —+; is a congruence. 


Now define <;; as the sum of <ig, +p! P<pq iff there is a sequence of expressions $9,....S, 


such that p=Sp, $, =q, and for i=0,....n-1, 8; <j 45}, | 07S; 47 pS; 4.1. 


It is easy to sec that <p is reflexive and transitive. Also if py<pq), P2<pq9, then 
Py t+ p2<pd) +42 and py*p.<p:4)°q> (because both <;q and +—}; have this property [36]). 


Finally, define =,, as follows: p=).q iff both p<j,.q and q<pp. 


The relation =, is an equivalence relation, and moreover it is a congruence. One can further 


observe that the equivalence classes of =; form a lattice Ly: under the induced *,+: just check the 


axioms LA, ¢.g. p+p=pp because p+p=jgp, and in gencral if p=,4q then p=pq. Note that Lp 
satisfies a PD p=q iff p=,-.q (AE is interpreted in L) as the equivalence class of A). 


We now show that the relation =, captures the PD’s (finitely) implied by E: 


Lemma 5.2: The following statements are equivalent: 


a.e=fe" 
b. EF ,,,.¢ =e" 


Cc. EF it ine =e 


Proof: Observe that, from the way <,, and <;; were defined, if e<j,e" then e<e’ in every lattice 
satisfying E (where < is the partial order of the lattice). Thus, (a)=>(b). To prove (b)=> (a), recall 
that L;; satisfies a PD p=q iff p= fq. Thus, ife#e’ then L;, docs not satisfy e=e’, whereas it satisfies 


E; i.e., Lp is a counterexample to FF y,,e=e". 
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We now show the equivalence of (b),(c). The direction (b)=>(c) is obvious. To prove the converse, 


we adapt an argument of [30] (see also [28]), originally given for the special case E= @. 


Suppose E docs not imply c=e’ : we will show that there is a finite lattice which satisfics E but 
violates c=e”, Let {A; | i= 1.....n} be the set of attributes appearing in E,c,c’, and Iet V be the sct of all 


partition expressions (over the A,’s) of complexity at most as high as the maximum complexity of e,e” 


and the expressions in E (complexity can be measured by the number of instances of *,+). Note that 


V is finite, since E is finite. 


Consider now the subset I. of 1.,; consisting of all finite products of the equivalence classes (under 
=};) of elements of V, together with the equivalence class of A, +...+-A,. It is not hard to verify that 
I. is a sublattice of 14;. But by the equivalence of (a),(b) e# ec", so L satisfies E and violates e=e’. 


Since I. is also obviously finite, we are done. &f 
We can now prove our main result: 
Theorem 5.3: There is a polynomial-time algorithm for the (finite) implication problem for PD’s. 


Proof: By Lemmas 5.1, 5.2, it is sufficient to describe a polynomial-time algorithm to test, given 


E,¢,e; whether e< pe’. 


Let V be the set of all subexpressions of c,c’, and of the expressions appearing in E. The following 


algorithm constructs a set T of directed arcs over V such that, whenever (p,q)€l, p<jgq or p> p:Q: 
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begin 

re-@ 

repeat until no new arcs are added 
1. Add (A,A), AEU 
2. if (per, (qnrel, p+qEV 
then add (p+q,r) 

if (PNET or (g.nelr, ppqeV 
then add (p°q,r) 

. if (,py€F, (nq)€r, peqeV 
then add (r,p*q) 

.if (@py€et or (r.q)€l, p+ qeV 
then add (r,p+q) 


Ww 


> 


WN 


6. Add (z,w),(w,z), where z= w in E 
.if Pe’, (,q)er 
then add (p,q) 


~ 


end 


end 
Observe that Steps 1-5 in the above algorithm mirror the definition of <iq. 
We will now prove the following 
Claim: For p,q€V, p<_pq iff (p,q)€P. 


Clearly, the Theorem follows from the Claim: to test if e<,e’, construct the digraph (VI) and 


check if it has an arc from e to e’. This can be done in polynomial time. 
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Proof of Claim: 
(=): Straightforward. 


(=>): We first give a set of rewrite rules [41] for <p: 

l.x+x7—-x 

2. x*y4-9x 

3. y°x—4+4X 

4. x4 -4x*x 

5. x4 x+y 

6.x y+x 

7, z——w, where z= w (w=z) is in E 

Observe, regarding Rules 5,6, that y can be an arbitrary expression. 


An casy induction shows that, if p<,qq, then p can be rewritten as q using Rules 1-6. By the 
definition of <p, if p<):q then there is a sequence of expressions §p,...,8,, such that p=Sp, s, =q, and 
for i=0,...,.n-1, $+ +5; , }, Le. $;, 1 is obtained from s; by rewriting a subexpression of s; according to 


one of the Rulcs 1-7. We call such a sequence a proof that p<,q. 
Now we define a relation < on pairs of expressions: 


(P1.4))~<(P2.92) iff P) Spd), P2<pd2, and either 
(i) the shortest proof that p) <j-q, is shorter than the shortest proof that py<j:q9, or 


(ii) the shortest proofs that py <):). P7<p242 have the same length, and p, is a proper subexpression 


of p>, q; is a proper subexpression of q. 
Clearly -< is well-founded. We proceed by induction on ~<, 
Basis: There is a proof that p<jq of length 0. Then p is identical to q, and (p,q)€r. 


Induction Step: Let p,q€V, and assume that the Claim holds for p,q’€V whenever (p‘.q° )-<(p,q). 
We will show that the Claim holds for (p,q). Let 5o,...,5,, n>0, be a shortest proof that p<p-q. 
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Case 1: For i=0,...n-1, s;,) is obtained from s; by rewriting a proper subexpression of s; 
according to Rules 1-7. Then p=p,@p>, q=q)@q> (@€{*.+}), where p;<j,g; via proofs at most as 
long as the proof that p<j.q, and p, (q;) is a proper subexpression of p (q). Thus (p;,q;)~<(p,q), and 
furthermore p;,q;€ V, so by the induction hypothesis (p;,q,)€T. It then casily follows that (p,q)€Y. 


Case 2: For some i, O0<i<n-1, s; is rewritten into s;, ; according to one of the Rules 1-7. 


Case 2a: For some i as above, the Rule used is Rule 7. This means p is rewritten to z, z= w (w=Z) 
is in FE, and w is rewritten to q. Then clearly (p,z)~<(p,q), and since z€V, by the induction hypothesis 
(p.zJEP. Similarly (w.q)€T. It follows that (p,q)€P. 


Case 2b: For any i as above, the Rule used is one of the Rules 1-6. We consider the /east such i, 


and we distinguish cases according to which Rule was used to rewrite $; to $; 4 7. 


Rule 1 This means p=p;+py, p, rewrites to r, py rewrites to r, and r rewrites to q. Then pj<pq 
via proofs shorter than the proof that p<pq, so (p;,q)~<(p,q). Also p,;€V, so by the induction 
hypothesis (p;,q)€I. It follows that (p,q)€Pr. 


Rule 2 This means p=p,*p», p, rewrites to r, r rewrites to q. Then p, <;.q via a proof shorter than 


the proof that p<;.q, so (p}.q)<(p,q). Also p,€V, so by the induction hypothesis (p,,q)€P. It follows 
that (p,q)€P. 


Rule 3 Similar to Rule 2. 


Rule 4 Now p rewrites to r, and Rule 4 rewrites r to rer. Observe that the expression rer will not be 
rewritten subsequently using Rules 2,3, because in that case we could shorten the proof that p<pq 
(however, either subexpression of rer may be rewritten). Moreover, if at some later point Rule 5 is 
applied to rewrite the whole expression s, as s,+y, then s;+y will not be rewritten subsequently using 
Rule 1. Thus, the expression q eventually obtained is built up, using Rules 4,5,6, by some expressions 
r,, j= 1,...,.m, such that r rewrites to VF for all j, and by some completely new expressions y,, k=1,...,m‘, 
which were introduced by Rules 5,6. Now clearly (p.1))<(p.q) and rEV, so by the induction 
hypothesis (p.rE I. It then follows by an easy induction on the structure of q that (p,q)€Y. 


Rules 5,6 Similar to Rule 4. 


This concludes the Proof of the Claim, so we are done. 
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Since inference of FD’s can be seen as a special case of inference of PID’s, the problem is actually 
polynomial-time complete (63). However, in the special case where E is empty [28, 65] it can be solved 


in /ogarithmic space [40], as we now outline. By |.cmma 3, it suffices to describe how to recognize <iq 
in logarithmic space. 

First, observe the following: 

1. A<jgA iff A is identical to A’, AJA“ in UL, 

2. A<jqp eq’ iff A<jqp’ and A<jgq’, A in U. 

3. ASigp' +q' iff A<jgp’ or A<igq% A in U. 

4. peq<igA’ iff p<jgA’ or q<igA4 A’ in U. 

5. p*q<igP*q iff p*q<jqp’ and p*q<igq’. 

6. p*aSigP +q iff p<igP'+Q° or QS jgP' +4" or p*qSigP’ or p*qS ig. 

7. p+qige’ iff p<jqe’ andq<jqe’. 
In each of the above cascs, the "if" direction is trivial. The “only-if’ direction follows in Case 5 


because 
p’°q’<jgp’ and p’*q’<qq, and in Case 7 because p<jgP +4, d<igP +. In the remaining cases, the 


“only-if" direction follows by the definition of <iq. 


The above observation gives a recursive algorithm to test, given e,e’, whether c<jge’. We now 


describe how to implement this recursion using only logarithmic auxiliary space. 


First, note that the results of intermediate recursive calls need not be stored. For example, 
consider Case 7: if the recursive call for p<,ge" returns false, then we immediately return false; 


otherwise, we return the result of the recursive call for q<jge" 


We will also argue that we do not necd to store the arguments of previous recursive calls. ‘Thus, all 
we necd to have in storage at any particular point is the arguments of the recursive call which is being 
evaluated. Since these arguments are subexpressions of e,c’, we can just have two pointers to the 


appropriate places in the input, and this only takes logarithmic space. 


We will now describe how, given two pointers to two subexpressions p,p’ of e,e’ respectively, we 
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can find the next recursive call to be evaluated, using only logarithmic additional space. We assume 
that c,c’ are represented (in the standard way) as binary trees, so that, given a pointer to a node u, we 
can find a pointer to the father (right son, Ieft son) of u. 

We use two auxiliary pointers a,a’, initialized to the root of ¢,e’ respectively. Let C(e,e’) be the set of 
recursive calls generated from the call e<jge” (C(e,c’) contains cither two or four members, 
depending on which of Cases 2-7 is the relevant one). We will show that we can determine which 
member of C(¢,c’) eventually gives rise to the call p<,qp’, using only logarithmic additional space. If 
this member of C(c,c’) turns out to be the call ¢;<jge], we sct the pointers a,a’ to the expressions 
¢),¢; respectively and we repeat with C(e),c;). Continuing in this way, we will eventually find ¢;,¢; 


such that the call p<jyp’ is in C(e;,¢;). We can then casily determine the next call to be evaluated. 


Finally, note that, to determine which member of C(e,c’) eventually gives rise to the call p<jqp’, 
we only need to know whether p (p’) is in the left or in the right subtree of c (e’). This can be found 
be walking the tree representing e in a depth-first fashion, until we encounter p, This walk can be 
done using only logarithmic additional space, because all we need to remember is the node v which is 
currently visited and the node w which was visited immediately before v: if w is the father of v, we 
next visit the Ieft son of v; if w is the Icft son of v, we next visit the right son of v; if w is the right son 


of v, we next visit the father of v. 


5.4 Testing Satisfaction 


Given a database d over U and a sct of PD’s E, we want to test if d is consistent with E, i.e if there 
is a weak instance w for d satisfying E. Recall that a relation w over U is a weak instance for d iff 
every tuple of relation R[U] of d appears in the projection of w on U. Weak instances have been 
proposed as a way to model incomplete information in databases [38, 64]. Given a database d and a 
set of FD’s E, we can test if d has a weak instance satisfying E in polynomial time [38]. We now show 


how this test can be generalized to arbitrary PD’s. 


First, we replace E by a set E’ of PD's of the form C=A*B or C=A+B, where A,B,C are 
attributes from a universe UU’ containing “U: this is done by (recursively) replacing X = Y*Z by the 


PD’s X=C, Y=A, Z=B, C=A*B, where A,B,C are new attribute names. It is casy to check that 


there is a weak instance for d satisfying E iff there is a weak instance for d satisfying E’. 
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Let us denote by p-+q, where p,q are partition expressions, the PD p=peq. This slight abuse of 
notation is consistent, since the FPD XY is actually equivalent to the FD X-Y. Now a PD 
C=A*B in E’ can be replaced by the FPD's C+AB, ABC, and a PID C=A+B in E’ can be 
replaced by the PD’s A+ B-C, C3A+B. Furthermore, the PD A+B--C can be replaced by the 
FPD’s AC, B-C. We now have a set F consisting of FPID’s and of PD's of the form C-+A + B, and 
it is obvious that there is a weak instance for d satisfying E’ iff there is a weak instance for d satisfying 


F, 


Now compute (using the algorithm of the previous Section) all consequences of F of the form 
A—B, A,B in UW’, and add them to F. Furthermore, if now F contains A+B and C-+A +B, replace 


C-+A+B by CB. Let F’ be the set of FPD’s in F. The crucial fact is given in the following 


Lemma 5.3: There is a weak instance for d satisfying F iff there is a weak instance for d satisfying 


Proof: The "only if" direction is obvious. For the converse, let w be a weak instance for d 
satisfying F’ Suppose some PD C-+A +B in F is violated by tuples t,t, of w, where t,[ABC]=a,byc, 
tp[ABC]=ayb 5c, a;#a,, b)#b 2. We can remedy this violation by adding to w a tuple s such that 
s[AB]= a,b». To make sure that the relation w, obtained still satisfies F’, let A* ={X | FAX}, 
B* = {X | FF B-X}: we make s[A*]=t,A*], s[B*]=t[B*], and fill in the rest of the attributes of 
s with distinct new valucs (not appearing in w). To argue that this is indced possible, observe first that 
B is not in A* and A is not in B* (otherwise C+A+B would not appear in F). We also have to 
make sure that, if Q€EA* and QEB*, then t,[Q]=t,[Q]. But if Q appears in both At and B* we 
have F‘=A-Q, F’*—B-—-Q, so since C-+A+B is in F we have FRC-—Q, and therefore CQ is in 


F’, This implics that t,[Q]=t,[Q], since t,[C]=t,[C] and w satisfies F” 


We now repeat the above argument, starting with w), to obtain relations w>, w3 and so on. The 
relation w,, obtained after an infinite number of steps is a weak instance for d satisfying E’, because 


any violation of some PD CA +B appearing at any stage has been taken care of at some later stage. 


We can now prove the main result: 


Theorem 5.4: There is a polynomial-time algorithm to test whether a given database d is consistent 


with a set Eof PD's. 
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Proof: Using the polynomial-time algorithm for inference of PD's given in Section 5.3, we can 


construct the set F’, By Lemma 5.3, cn es st tops af aon oe 89 cnet wi 
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Figure 5-2: MVD’s are not expressible by PD’s 
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NM, = 13) (24) 


Chapter Six 


Directions for Further Investigation 


Extending the Kquational Approach 


Of course, the most obvious question is whether our equational formulation of FD’s and IND’s 
can be extended to more general dependencies. We outline some partial results we have at this point, 


which indicate that such an extension is indecd possible. 


Recall that an embedded implicational dependency (EID) is a typed sentence of the form 
Vx1.Xp. (pA... Ap) Fy 1-Vq: Wy A. Ava) 
where each 9, is a relational formula, cach ¥, is either a relational formula or an equality between 
two of the x,’s, and each of the x,;’s appears in one of the @,’s (cf. Section 5.1). If all the ,’s are 
relational formulas, we have a tuple generating dependency (TGD),; if all the p,’s are equalities, we 


have an equality generating dependency (EGD) (10, 11, 34]. 


Every EID is obviously equivalent to the conjunction of a TGD and an EGD. Furthermore, it can 
be shown that every EGD is equivalent to a conjunction of FD’s and TGD’s [11]. The question then 


is whether we can have an cquational formulation of FD’s and TGD’s. 


Let U={A,B,C} and consider the MVD A-+—>B (cf. Example 5.1). We can formulate it as the 

sentence 

WX)Xp. [a(x1) = a(xp)=> Ay. (aly) = ax PA Wy) = Wx) Acly) = xy))]. 
Here x),X2,y are variables ranging over tuples; see Section 1.3. Now Skolemization suggests 
transforming this MVD into an equational implication 

aX = aXy=>(aix)x,=ax,AbixX = bx Acix) x, =cx9) 
In this way, we can transform any TGD into an equational implication. In fact, we can even relax the 
typedness restriction, to obtain a class of constraints which properly includes IND’s: specifically, it 


suffices if only the part of the sentence consisting of the ,’s is typed. 


We can go even further and transform these equational implications into equations. We illustrate 
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how this is done with the implication 
aX) = aX7=> aix,X> =aX). 
This can be transformed into the sect of equations 


aix X= f,x)X7aX 1 aX» 
f,X1X9XX = aX), 


where f, is a new function symbol of ARITY 4. 


The above cquational formulation of TGD’s can be used to prove a generalization of Theorem 
2.1, for implication of TGD’s from FD’s and TGD’s (i.c., we actually generalize the IND Case of 
Theorem 2.1). The proof uses the same ideas as the proof of Theorem 2.1. Unfortunately, the proof of 
the FD Case docs not generalize, because the inductive argument for the completeness part depends 
critically on the fact that Skolem functions have only one argument (which only happens in the case 


of IND’s). 
Designing Normal Form Schemas 


An active areca of research in logical database design is concerned with canonical representations 
of the database schema, which avoid potential update anomalies (i.c. updates that can result in 
inconsistent data), and minimize data redundancy. Several such representations have been proposed 
and analyzed, assuming that the only integrity constraints of the database schema are FD’s. The 
gencral idea is that the database schema should be in a certain normal form {22, 7, 62, 51], i.e. certain 
restrictive conditions should be satisfied by the FD’s of the schema and their logical consequences. 
Given a universe U of attributes and a finite set 2 of FD’s, one can construct a database schema 
satisfying such restrictions [12, 6]. These algorithms are based on efficient solutions of the implication 


problem. 


An interesting question is to investigate normal forms in the presence of FD’s and IND’s (cf. [33]). 
Eventually one would hope to extend the known schema synthesis algorithms to incorporate IND’s of 
some restricted form (for example, unary IND’s). The insights we have gained on the implication 


problem can potentially be useful for this investigation. 
Query Equivalence in the Presence of IND’s 


The problem of optimizing queries has reccived a lot of attention, because of its central role in all 


relational database implementations [62]. Given a query Q, the goal is to design an equivalent query 
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Q’ which can be processed as efficiently as possible (i.c. contains a minimum number of instances of 
expensive operators, such as join). Since equivalence of two queries is a data dependency, the 
problem of testing equivalence of queries in the presence of dependencies can be approached with 


the standard tools for implication problems [3, 18, 62]. 


The equivalence of relational database queries in the presence of FD’s and IND’s has been 
examined in [43, 48], essentially by extending classical techniques (namely the chase). The authors of 
[43] show that under reasonable restrictions on the IND’s, query equivalence can be reduced to well- 
understood cases involving only FD’s. The approach of [48] is to introduce the weak instance 
assumption [38, 64]; under this restriction, query equivalence in the presence of FD’s and typed 


IND’s can be handled by the methods of [43]. 


Many questions remain unanswered in the area and new techniques seem to be required to handle 
major new cases. The techniques we have developed for FD and IND implication may be useful in 
this respect. In particular, it would be interesting to see if the tools we provide for typed [ND’s can 
be used to study equivalence of (typed) conjunctive qucrics [18, 43] in the presence of typed IND’s 


and FD’s, without the weak instance assumption of [48]. 
Expressing Data Distribution 


An important consideration in the context of distributed databases is to find ways to preprocess 
relations stored at different sites, so that a given query can be processed with a minimum amount of 
data communication between sites. Some work has already been done on characterizing database 
schemes and queries for which such preprocessing is possible [8, 13]. An interesting research direction 
is to extend these results to allow for the presence of FID’s (conceivably we will be able to preprocess 
more qucries if the database is constrained to satisfy a set of FD’s). Since data distribution can be 
modeled by IND’s, these questions can be approached as implication problems involving FD’s and 


IND’s. 
Performance of Equational Theorem Provers 


An interesting practical question is how well theorem provers designed around the Knuth-Bendix 
method [46] perform on sets of equations obtained from database constraints. We have experimented 
with the REVE system [35, 49], which has been able to handle various non-trivial inferences of FD’s 


and IND’s. However, more work needs to be donc in this direction. 
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